Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements and support for Python 3.10 #39

Merged
merged 6 commits into from
Oct 14, 2021
Merged

Speed improvements and support for Python 3.10 #39

merged 6 commits into from
Oct 14, 2021

Conversation

adbar
Copy link
Contributor

@adbar adbar commented Oct 12, 2021

  • use lru_cache from functools to minimize stoplist processing time
  • use sets instead of lists for string search
  • take regex compilation out of loop
  • support for Python 3.10 in tests and setup

Copy link
Owner

@miso-belica miso-belica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I added some notes. Please take a look. Also, it seems Python 3.10 has a problem with the old pytest. Maybe it's finally time to drop Python 2.7. But I will deal with it later to decide if it can be fixed somehow even with v2.7 support.

.github/workflows/run-tests.yml Outdated Show resolved Hide resolved
justext/core.py Outdated Show resolved Hide resolved
justext/core.py Outdated Show resolved Hide resolved
@adbar
Copy link
Contributor Author

adbar commented Oct 13, 2021

Hi @miso-belica, thanks for the feedback, I implemented the changes you requested.

@adbar
Copy link
Contributor Author

adbar commented Oct 13, 2021

Another question: some changes were made by the Lexical Computing team, did you think of including them in this version of jusText?
https://corpus.tools/wiki/Justext_changelog

Copy link
Owner

@miso-belica miso-belica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adbar. I will check also the changes made by the team.

I am going to fix CI in the main, but in the future also for other projects, it is great to allow maintainers to edit your PR. Just a tip ;)

@miso-belica miso-belica merged commit 31935c2 into miso-belica:main Oct 14, 2021
@adbar
Copy link
Contributor Author

adbar commented Oct 14, 2021

@miso-belica sorry, my bad, I left it unchecked.

@miso-belica
Copy link
Owner

No worries but while I have your attention. Do you know the URL of the git repo for https://corpus.tools/wiki/Justext_changelog? It's hard to find changes without comparable git history.

@adbar
Copy link
Contributor Author

adbar commented Oct 14, 2021

No, I don't know if it's publicly available, I just found this commit mentioned in the changelog:

@adbar
Copy link
Contributor Author

adbar commented Oct 20, 2021

@miso-belica All changes combined lead to a 2.5x speedup on my data!

The commit mentioned above didn't get strictly translated or at least it has unexpected effects: slightly more recall and less precision. So I don't know what to think, maybe we should revert it, amend it or investigate further.

Anyway, a complete test suite would also be nice to detect such changes...

@miso-belica
Copy link
Owner

Wow, didn't expect such a speedup on those micro-optimizations. Good work 👏

The commit you mentioned really just fixes the bug in the parsing data, so I would let it be. It also has a test that would fail otherwise. If the NLP score is a bit different maybe it's a task to improve the algorithm/implementation somewhere but I don't believe we should add that bug back just to have a better score. The score would be just synthetic probably, meaning it's a coincidence for some specific data.

@adbar
Copy link
Contributor Author

adbar commented Oct 20, 2021

Thanks! OK, let's keep it that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants