Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile regexp in detokenizer #143

Merged
merged 2 commits into from
Sep 27, 2023
Merged

Compile regexp in detokenizer #143

merged 2 commits into from
Sep 27, 2023

Conversation

jelmervdl
Copy link
Collaborator

Together with #133 this replaces #140.

main: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     35.786 s ±  0.612 s    [User: 35.058 s, System: 0.475 s]
  Range (min … max):   34.669 s … 36.835 s    10 runs

this: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):      8.581 s ±  0.119 s    [User: 8.181 s, System: 0.383 s]
  Range (min … max):    8.453 s …  8.789 s    10 runs

@ZJaume
Copy link
Collaborator

ZJaume commented Sep 14, 2023

I don't want to be picky, but does that big.txt contain tokenized sentences? Performance may be different if input is not tokenized?

@jelmervdl
Copy link
Collaborator Author

True, I assumed that it wouldn't matter that much for performance comparisons. I've now run the same thing on a tokenized version of big.txt. The difference is slightly smaller, but still big enough for this change I'd say.

main: cat big.tok.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     34.814 s ±  0.724 s    [User: 34.226 s, System: 0.464 s]
  Range (min … max):   33.846 s … 36.157 s    10 runs

this: cat big.tok.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):      9.253 s ±  0.172 s    [User: 8.828 s, System: 0.381 s]
  Range (min … max):    9.060 s …  9.560 s    10 runs

@jelmervdl jelmervdl merged commit 303ae7f into master Sep 27, 2023
15 checks passed
@jelmervdl jelmervdl deleted the regex-optim-alt branch September 27, 2023 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants