-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvements and support for Python 3.10 #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. I added some notes. Please take a look. Also, it seems Python 3.10 has a problem with the old pytest. Maybe it's finally time to drop Python 2.7. But I will deal with it later to decide if it can be fixed somehow even with v2.7 support.
Hi @miso-belica, thanks for the feedback, I implemented the changes you requested. |
Another question: some changes were made by the Lexical Computing team, did you think of including them in this version of jusText? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adbar. I will check also the changes made by the team.
I am going to fix CI in the main, but in the future also for other projects, it is great to allow maintainers to edit your PR. Just a tip ;)
@miso-belica sorry, my bad, I left it unchecked. |
No worries but while I have your attention. Do you know the URL of the git repo for https://corpus.tools/wiki/Justext_changelog? It's hard to find changes without comparable git history. |
No, I don't know if it's publicly available, I just found this commit mentioned in the changelog: |
@miso-belica All changes combined lead to a 2.5x speedup on my data! The commit mentioned above didn't get strictly translated or at least it has unexpected effects: slightly more recall and less precision. So I don't know what to think, maybe we should revert it, amend it or investigate further. Anyway, a complete test suite would also be nice to detect such changes... |
Wow, didn't expect such a speedup on those micro-optimizations. Good work 👏 The commit you mentioned really just fixes the bug in the parsing data, so I would let it be. It also has a test that would fail otherwise. If the NLP score is a bit different maybe it's a task to improve the algorithm/implementation somewhere but I don't believe we should add that bug back just to have a better score. The score would be just synthetic probably, meaning it's a coincidence for some specific data. |
Thanks! OK, let's keep it that way. |
lru_cache
from functools to minimize stoplist processing time