Speed improvements and support for Python 3.10 #39

adbar · 2021-10-12T16:14:01Z

use lru_cache from functools to minimize stoplist processing time
use sets instead of lists for string search
take regex compilation out of loop
support for Python 3.10 in tests and setup

miso-belica

Thanks for the changes. I added some notes. Please take a look. Also, it seems Python 3.10 has a problem with the old pytest. Maybe it's finally time to drop Python 2.7. But I will deal with it later to decide if it can be fixed somehow even with v2.7 support.

.github/workflows/run-tests.yml

justext/core.py

adbar · 2021-10-13T16:22:31Z

Hi @miso-belica, thanks for the feedback, I implemented the changes you requested.

adbar · 2021-10-13T16:25:42Z

Another question: some changes were made by the Lexical Computing team, did you think of including them in this version of jusText?
https://corpus.tools/wiki/Justext_changelog

miso-belica

Thank you @adbar. I will check also the changes made by the team.

I am going to fix CI in the main, but in the future also for other projects, it is great to allow maintainers to edit your PR. Just a tip ;)

adbar · 2021-10-14T15:42:02Z

@miso-belica sorry, my bad, I left it unchecked.

miso-belica · 2021-10-14T15:46:40Z

No worries but while I have your attention. Do you know the URL of the git repo for https://corpus.tools/wiki/Justext_changelog? It's hard to find changes without comparable git history.

adbar · 2021-10-14T16:31:28Z

No, I don't know if it's publicly available, I just found this commit mentioned in the changelog:

https://github.com/msklvsk/justext/commits/master

adbar · 2021-10-20T12:57:42Z

@miso-belica All changes combined lead to a 2.5x speedup on my data!

The commit mentioned above didn't get strictly translated or at least it has unexpected effects: slightly more recall and less precision. So I don't know what to think, maybe we should revert it, amend it or investigate further.

Anyway, a complete test suite would also be nice to detect such changes...

miso-belica · 2021-10-20T14:31:20Z

Wow, didn't expect such a speedup on those micro-optimizations. Good work 👏

The commit you mentioned really just fixes the bug in the parsing data, so I would let it be. It also has a test that would fail otherwise. If the NLP score is a bit different maybe it's a task to improve the algorithm/implementation somewhere but I don't believe we should add that bug back just to have a better score. The score would be just synthetic probably, meaning it's a coincidence for some specific data.

adbar · 2021-10-20T15:59:50Z

Thanks! OK, let's keep it that way.

adbar added 3 commits October 12, 2021 18:11

core.py: code speed improvements

5957250

add support for Python 3.10

b49ead4

fix Github tests for 3.10

5a7710e

miso-belica requested changes Oct 13, 2021

View reviewed changes

.github/workflows/run-tests.yml Outdated Show resolved Hide resolved

justext/core.py Outdated Show resolved Hide resolved

justext/core.py Outdated Show resolved Hide resolved

adbar added 3 commits October 13, 2021 18:19

github actions: 3.10.0 → 3.10

7eef289

PARAGRAPH_TAGS as frozenset()

e26c782

use plain search instead of regex SELECT_PATTERN

62e966e

miso-belica approved these changes Oct 14, 2021

View reviewed changes

miso-belica merged commit 31935c2 into miso-belica:main Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvements and support for Python 3.10 #39

Speed improvements and support for Python 3.10 #39

adbar commented Oct 12, 2021

miso-belica left a comment

adbar commented Oct 13, 2021

adbar commented Oct 13, 2021

miso-belica left a comment

adbar commented Oct 14, 2021

miso-belica commented Oct 14, 2021

adbar commented Oct 14, 2021

adbar commented Oct 20, 2021

miso-belica commented Oct 20, 2021

adbar commented Oct 20, 2021

Speed improvements and support for Python 3.10 #39

Speed improvements and support for Python 3.10 #39

Conversation

adbar commented Oct 12, 2021

miso-belica left a comment

Choose a reason for hiding this comment

adbar commented Oct 13, 2021

adbar commented Oct 13, 2021

miso-belica left a comment

Choose a reason for hiding this comment

adbar commented Oct 14, 2021

miso-belica commented Oct 14, 2021

adbar commented Oct 14, 2021

adbar commented Oct 20, 2021

miso-belica commented Oct 20, 2021

adbar commented Oct 20, 2021