Improve performance of Similarity::calculateVectors#66
Closed
althonos wants to merge 4 commits into
Closed
Conversation
…undant checks in `similarityMatrix::getDistance`
althonos
added a commit
to althonos/pytrimal
that referenced
this pull request
Jun 26, 2022
althonos
added a commit
to althonos/pytrimal
that referenced
this pull request
Jun 26, 2022
althonos
added a commit
to althonos/pytrimal
that referenced
this pull request
Jun 26, 2022
althonos
added a commit
to althonos/pytrimal
that referenced
this pull request
Jun 27, 2022
Author
|
Superseded by #69 . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi there!
While working on althonos/pytrimal I did some thorough profiling of the code, and I identified some critical sections that could be improved. In particular, I noticed that the code in
Similarity::calculateVectorswas sub-optimal, because it was repeatedly callingsimilarityMatrix::getDistancewith the same sequence characters, and the check for invalid/incorrect symbols seems to have a high performance impact.To fix this, I added two buffers to store column data; the first one for the sequence itself, storing uppercase column characters to reduce the number of
utils::toUppercalls; the other one to store the indices of gapped/indeterminate characters. The sequence characters for a column are checked once when the column is copied; after that, the distance matrix is indexed directly, without checking character ranges.I used
valgrindto count cycles on a run of trimAl in strict mode onexample.073.AA.strNOG.ENOG411BFCW.fasta, here are the results in number of cycles:Similarity::calculateVectors(self)Similarity::calculateVectors(incl)trimal(total)