Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile regex objects ahead of time for improved perf. #133

Merged
merged 2 commits into from
Sep 14, 2023

Conversation

erip
Copy link
Contributor

@erip erip commented Jul 26, 2022

Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (MT1 is original w/o compilation, MT2 is new w/ compilation just for comparison -- this PR replaces the original impl).

In [1]: lines = [line.strip() for line in open('big.txt') if line.strip()][:1000]

In [2]: from sacremoses.tokenize import MosesTokenizer as MT1

In [3]: from sacremoses.tokenize2 import MosesTokenizer as MT2

In [4]: mt1, mt2 = MT1(lang='en'), MT2(lang='en')

In [5]: %timeit [mt1.tokenize(line) for line in lines]
714 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit [mt2.tokenize(line) for line in lines]
658 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@erip
Copy link
Contributor Author

erip commented Jul 26, 2022

As a quick note: if I replace import re with import regex as re, the timeit microbenchmark is 1.62 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). Quite the penalty just by switching the regex engine!

@jelmervdl jelmervdl merged commit eacfa95 into hplt-project:master Sep 14, 2023
@erip erip deleted the feature/compiled-regex branch September 14, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants