New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cutadapt 4.3 runs excruciatingly long #694
Comments
As the author of #663 I can say that this is unexpected behaviour. I did expect that some scenarios might cause slowdowns, but not this radical.
Also what is the purpose of your cutadapt call? To me it looks like poly-A trimming. Is that correct? @marcelm I see |
Hi @uniqueg, thanks a lot for the report. I was able to reproduce the problem even with an empty FASTQ file (
@uniqueg provided a test file, but as I mentioned, you can just use an empty FASTQ file; the slow part is the construction of the k-mer finder.
Using
Perhaps. I don’t remember exactly right now, but I think |
Oh dear. That makes sense. That uses a brute-force approach. For the illumina adapter this is only a small cost, but it gets much worse when the adapter becomes longer. I think these algorithms can be rewritten. Because the kmer finder matches multiple patterns simultaneously, there is no need to optimize for the least amount of patterns. The only reason to do so is to minimize the number of false positives. |
Thanks, great that you’re going to look into this. I got so far as to add this test, feel free to copy it: @pytest.mark.timeout(0.5)
def test_create_positions_and_kmers_slow():
create_positions_and_kmers(
"A" * 100,
min_overlap=3,
error_rate=0.1,
back_adapter=True,
front_adapter=False,
internal=True,
) |
Thanks everyone for the feedback. Hope you can find a fix a fix for this that still retains most/all of the advantages that the feature provides, @rhpvorderman. Use case is indeed adapter trimming. Having an explicit option may indeed be nice to have :) Also thanks to all developers for the great tool 🙏 |
Created a fix and included the test code provided by @marcelm. |
That was very quick, thanks!
But false positives for the heuristic just mean that the full algorithm is run more often, right? So the results shouldn’t change. |
Exactly, the result doesn't change. It just takes very slightly longer to get there. |
This issue motivated me sufficiently to finally implement proper poly-A trimming, so I just added option At the moment, |
Problem description
Our CI pipeline runtime recently (and suddenly) increased from ~3 min to more than 4h. We were able to identify an interal Cutadapt call as the culprit, with each individual call taking more than 1h. We did not observe a functional deterioration, but I am not 100% sure that our CI pipeline would necessarily pick that up.
The problem occurs only for the most recent version (
4.3
), not with versions4.1
and4.2
. Indeed, capping the range of the supported Cutadapt version to<=4.2
restored our CI runtime back to the typical ~3 min.Steps to reproduce
Here is the offending call:
where
FILE
is, for example, this tiny test file.Software information
conda-forge
buildh12debd9_5
bioconda
buidpy310h1425a21_0
Additional info
In our CI pipeline, the problem occurs for Python versions 3.7, 3.8 and 3.9 as well. Installation via Pip was not tested. When recreating the issue locally, one of my cores was running at 100% speed, with very little memory consumption. I did not wait for the command to conclude (my laptop got hot) and see if the output is different from that obtained for older versions (apologies!).
Looking at the changes introduced in
4.3
, my best bet would be on #663 possibly being the cause of this issue.The text was updated successfully, but these errors were encountered: