Cutadapt 4.3 runs excruciatingly long #694

uniqueg · 2023-04-11T23:11:26Z

Problem description

Our CI pipeline runtime recently (and suddenly) increased from ~3 min to more than 4h. We were able to identify an interal Cutadapt call as the culprit, with each individual call taking more than 1h. We did not observe a functional deterioration, but I am not 100% sure that our CI pipeline would necessarily pick that up.

The problem occurs only for the most recent version (4.3), not with versions 4.1 and 4.2. Indeed, capping the range of the supported Cutadapt version to <=4.2 restored our CI runtime back to the typical ~3 min.

Steps to reproduce

Here is the offending call:

cutadapt -a A{100} -o out.fastq FILE

where FILE is, for example, this tiny test file.

Software information

Python version: 3.10.0; conda-forge build h12debd9_5
Cutadapt version: 4.3; bioconda buid py310h1425a21_0
Installed via: Conda/Mamba

Additional info

In our CI pipeline, the problem occurs for Python versions 3.7, 3.8 and 3.9 as well. Installation via Pip was not tested. When recreating the issue locally, one of my cores was running at 100% speed, with very little memory consumption. I did not wait for the command to conclude (my laptop got hot) and see if the output is different from that obtained for older versions (apologies!).

Looking at the changes introduced in 4.3, my best bet would be on #663 possibly being the cause of this issue.

The text was updated successfully, but these errors were encountered:

rhpvorderman · 2023-04-12T07:01:35Z

As the author of #663 I can say that this is unexpected behaviour. I did expect that some scenarios might cause slowdowns, but not this radical.

What is the typical length of your reads? (So I can test on a similar file).

Also what is the purpose of your cutadapt call? To me it looks like poly-A trimming. Is that correct?
As a more permanent-fix you could run with --no-indels, this will disable the heuristic. In the case of AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA etc. it shouldn't matter much if indels are allowed or not. Also cutadapt will use a faster algorithm to find matches without indels IIRC.

@marcelm I see -a A{100} is the recommended way as per the documentation here: https://cutadapt.readthedocs.io/en/stable/recipes.html?highlight=poly#trim-poly-a-tails. Would adding --no-indels there be appropriate?

marcelm · 2023-04-12T07:37:03Z

Hi @uniqueg, thanks a lot for the report.

I was able to reproduce the problem even with an empty FASTQ file (touch empty.fastq) and with a shorter adapter -a A{70}. The extra runtime is spent on minimize_kmer_search_list and find_optimal_kmers, which were indeed added in #663.

As the author of #663 I can say that this is unexpected behaviour. I did expect that some scenarios might cause slowdowns, but not this radical.
* What is the typical length of your reads? (So I can test on a similar file).

@uniqueg provided a test file, but as I mentioned, you can just use an empty FASTQ file; the slow part is the construction of the k-mer finder.

Also what is the purpose of your cutadapt call? To me it looks like poly-A trimming. Is that correct? As a more permanent-fix you could run with --no-indels, this will disable the heuristic. In the case of AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA etc. it shouldn't matter much if indels are allowed or not. Also cutadapt will use a faster algorithm to find matches without indels IIRC.

Using --no-indels doesn’t help. The k-mer heuristic is only disabled when you combine --no-indels with anchored adapters.

@marcelm I see -a A{100} is the recommended way as per the documentation here: https://cutadapt.readthedocs.io/en/stable/recipes.html?highlight=poly#trim-poly-a-tails. Would adding --no-indels there be appropriate?

Perhaps. I don’t remember exactly right now, but I think --no-indels doesn’t give that much of a speedup for regular (non-anchored) adapters since it just uses the normal algorithm but with very high indel costs. I think it would be a good idea to add a special poly-A trimming option, though. That could be much faster than going through the adapter-trimming function.

rhpvorderman · 2023-04-12T07:54:42Z

The extra runtime is spent on minimize_kmer_search_list and find_optimal_kmers, which were indeed added in #663.

Oh dear. That makes sense. That uses a brute-force approach. For the illumina adapter this is only a small cost, but it gets much worse when the adapter becomes longer.

I think these algorithms can be rewritten. Because the kmer finder matches multiple patterns simultaneously, there is no need to optimize for the least amount of patterns. The only reason to do so is to minimize the number of false positives.
Let me see if I can rewrite it. If the number of false positives increases slightly this is not necessarily problematic.

marcelm · 2023-04-12T07:59:50Z

Thanks, great that you’re going to look into this. I got so far as to add this test, feel free to copy it:

@pytest.mark.timeout(0.5)
def test_create_positions_and_kmers_slow():
    create_positions_and_kmers(
        "A" * 100,
        min_overlap=3,
        error_rate=0.1,
        back_adapter=True,
        front_adapter=False,
        internal=True,
    )

uniqueg · 2023-04-12T08:23:47Z

Thanks everyone for the feedback. Hope you can find a fix a fix for this that still retains most/all of the advantages that the feature provides, @rhpvorderman.

Use case is indeed adapter trimming. Having an explicit option may indeed be nice to have :)

Also thanks to all developers for the great tool 🙏

rhpvorderman · 2023-04-12T10:05:58Z

Created a fix and included the test code provided by @marcelm.
The resulting code produces slightly inferior results in terms of adapter matching but this is offset by the hugely better runtime characteristics for very long adapters.

marcelm · 2023-04-12T10:34:39Z

That was very quick, thanks!

Created a fix and included the test code provided by @marcelm. The resulting code produces slightly inferior results in terms of adapter matching but this is offset by the hugely better runtime characteristics for very long adapters.

But false positives for the heuristic just mean that the full algorithm is run more often, right? So the results shouldn’t change.

rhpvorderman · 2023-04-12T10:35:31Z

Exactly, the result doesn't change. It just takes very slightly longer to get there.

marcelm · 2023-04-12T14:18:12Z

This issue motivated me sufficiently to finally implement proper poly-A trimming, so I just added option --poly-a. This is in my tests more accurate and about eight times (!) faster than using -a A{100}. I plan to release Cutadapt 4.4 soon with this new feature (and other fixes).

At the moment, --poly-a is run after adapter trimming, which appeared to make the most sense IMO, but please let me know if you think a different order would be good.

This was referenced Apr 11, 2023

GitHub Actions takes hours zavolanlab/htsinfer#120

Closed

ci: downgrade Cutadapt & silence warning zavolanlab/htsinfer#119

Merged

rhpvorderman mentioned this issue Apr 12, 2023

Fix exponential runtime for longer adapters #695

Merged

marcelm closed this as completed in #695 Apr 12, 2023

marcelm mentioned this issue Apr 12, 2023

Add poly-A trimming #696

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cutadapt 4.3 runs excruciatingly long #694

Cutadapt 4.3 runs excruciatingly long #694

uniqueg commented Apr 11, 2023

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023 •

edited

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023

uniqueg commented Apr 12, 2023

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023 •

edited

Cutadapt 4.3 runs excruciatingly long #694

Cutadapt 4.3 runs excruciatingly long #694

Comments

uniqueg commented Apr 11, 2023

Problem description

Steps to reproduce

Software information

Additional info

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023 • edited

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023

uniqueg commented Apr 12, 2023

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023

rhpvorderman commented Apr 12, 2023

marcelm commented Apr 12, 2023 • edited

marcelm commented Apr 12, 2023 •

edited

marcelm commented Apr 12, 2023 •

edited