Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cutadapt 4.3 runs excruciatingly long #694

Closed
uniqueg opened this issue Apr 11, 2023 · 9 comments · Fixed by #695
Closed

Cutadapt 4.3 runs excruciatingly long #694

uniqueg opened this issue Apr 11, 2023 · 9 comments · Fixed by #695

Comments

@uniqueg
Copy link

uniqueg commented Apr 11, 2023

Problem description

Our CI pipeline runtime recently (and suddenly) increased from ~3 min to more than 4h. We were able to identify an interal Cutadapt call as the culprit, with each individual call taking more than 1h. We did not observe a functional deterioration, but I am not 100% sure that our CI pipeline would necessarily pick that up.

The problem occurs only for the most recent version (4.3), not with versions 4.1 and 4.2. Indeed, capping the range of the supported Cutadapt version to <=4.2 restored our CI runtime back to the typical ~3 min.

Steps to reproduce

Here is the offending call:

cutadapt -a A{100} -o out.fastq FILE

where FILE is, for example, this tiny test file.

Software information

  • Python version: 3.10.0; conda-forge build h12debd9_5
  • Cutadapt version: 4.3; bioconda buid py310h1425a21_0
  • Installed via: Conda/Mamba

Additional info

In our CI pipeline, the problem occurs for Python versions 3.7, 3.8 and 3.9 as well. Installation via Pip was not tested. When recreating the issue locally, one of my cores was running at 100% speed, with very little memory consumption. I did not wait for the command to conclude (my laptop got hot) and see if the output is different from that obtained for older versions (apologies!).

Looking at the changes introduced in 4.3, my best bet would be on #663 possibly being the cause of this issue.

@rhpvorderman
Copy link
Collaborator

As the author of #663 I can say that this is unexpected behaviour. I did expect that some scenarios might cause slowdowns, but not this radical.

  • What is the typical length of your reads? (So I can test on a similar file).

Also what is the purpose of your cutadapt call? To me it looks like poly-A trimming. Is that correct?
As a more permanent-fix you could run with --no-indels, this will disable the heuristic. In the case of AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA etc. it shouldn't matter much if indels are allowed or not. Also cutadapt will use a faster algorithm to find matches without indels IIRC.

@marcelm I see -a A{100} is the recommended way as per the documentation here: https://cutadapt.readthedocs.io/en/stable/recipes.html?highlight=poly#trim-poly-a-tails. Would adding --no-indels there be appropriate?

@marcelm
Copy link
Owner

marcelm commented Apr 12, 2023

Hi @uniqueg, thanks a lot for the report.

I was able to reproduce the problem even with an empty FASTQ file (touch empty.fastq) and with a shorter adapter -a A{70}. The extra runtime is spent on minimize_kmer_search_list and find_optimal_kmers, which were indeed added in #663.

As the author of #663 I can say that this is unexpected behaviour. I did expect that some scenarios might cause slowdowns, but not this radical.

* What is the typical length of your reads? (So I can test on a similar file).

@uniqueg provided a test file, but as I mentioned, you can just use an empty FASTQ file; the slow part is the construction of the k-mer finder.

Also what is the purpose of your cutadapt call? To me it looks like poly-A trimming. Is that correct? As a more permanent-fix you could run with --no-indels, this will disable the heuristic. In the case of AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA etc. it shouldn't matter much if indels are allowed or not. Also cutadapt will use a faster algorithm to find matches without indels IIRC.

Using --no-indels doesn’t help. The k-mer heuristic is only disabled when you combine --no-indels with anchored adapters.

@marcelm I see -a A{100} is the recommended way as per the documentation here: https://cutadapt.readthedocs.io/en/stable/recipes.html?highlight=poly#trim-poly-a-tails. Would adding --no-indels there be appropriate?

Perhaps. I don’t remember exactly right now, but I think --no-indels doesn’t give that much of a speedup for regular (non-anchored) adapters since it just uses the normal algorithm but with very high indel costs. I think it would be a good idea to add a special poly-A trimming option, though. That could be much faster than going through the adapter-trimming function.

@rhpvorderman
Copy link
Collaborator

The extra runtime is spent on minimize_kmer_search_list and find_optimal_kmers, which were indeed added in #663.

Oh dear. That makes sense. That uses a brute-force approach. For the illumina adapter this is only a small cost, but it gets much worse when the adapter becomes longer.

I think these algorithms can be rewritten. Because the kmer finder matches multiple patterns simultaneously, there is no need to optimize for the least amount of patterns. The only reason to do so is to minimize the number of false positives.
Let me see if I can rewrite it. If the number of false positives increases slightly this is not necessarily problematic.

@marcelm
Copy link
Owner

marcelm commented Apr 12, 2023

Thanks, great that you’re going to look into this. I got so far as to add this test, feel free to copy it:

@pytest.mark.timeout(0.5)
def test_create_positions_and_kmers_slow():
    create_positions_and_kmers(
        "A" * 100,
        min_overlap=3,
        error_rate=0.1,
        back_adapter=True,
        front_adapter=False,
        internal=True,
    )

@uniqueg
Copy link
Author

uniqueg commented Apr 12, 2023

Thanks everyone for the feedback. Hope you can find a fix a fix for this that still retains most/all of the advantages that the feature provides, @rhpvorderman.

Use case is indeed adapter trimming. Having an explicit option may indeed be nice to have :)

Also thanks to all developers for the great tool 🙏

@rhpvorderman
Copy link
Collaborator

Created a fix and included the test code provided by @marcelm.
The resulting code produces slightly inferior results in terms of adapter matching but this is offset by the hugely better runtime characteristics for very long adapters.

@marcelm
Copy link
Owner

marcelm commented Apr 12, 2023

That was very quick, thanks!

Created a fix and included the test code provided by @marcelm. The resulting code produces slightly inferior results in terms of adapter matching but this is offset by the hugely better runtime characteristics for very long adapters.

But false positives for the heuristic just mean that the full algorithm is run more often, right? So the results shouldn’t change.

@rhpvorderman
Copy link
Collaborator

Exactly, the result doesn't change. It just takes very slightly longer to get there.

@marcelm
Copy link
Owner

marcelm commented Apr 12, 2023

This issue motivated me sufficiently to finally implement proper poly-A trimming, so I just added option --poly-a. This is in my tests more accurate and about eight times (!) faster than using -a A{100}. I plan to release Cutadapt 4.4 soon with this new feature (and other fixes).

At the moment, --poly-a is run after adapter trimming, which appeared to make the most sense IMO, but please let me know if you think a different order would be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants