-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first call to MosesTokenizer.tokenize is very slow #61
Comments
With English, it actually seems to hang forever (I I think it's getting hung compiling regular expressions somewhere.
|
The first behavior of using the tokenizer first time seems reasonable. The regexes would be compiled and cached and in the case of the new expanded perluniprop files, they're huge, so it makes sense. Second behavior of English hanging, shouldn't be the case: [in]:
[out]:
Might be some cache in the perluniprop files or some system problems. Which OS are you using? Which Python verion? |
But it do looks like the new version of the full perluniprop is does feel slower =( |
I'm on MacOS 10.14.1, python 3.7.1 (via anaconda). I'll run some more tests on my end with English later today on some different OSs and python installs too to see if I can isolate the problem. |
I'm having the same isssue, Ubuntu 18.04 and python 3.7.3 |
It works with python 3.6.8 |
Same issue here. Seems it's something wrong with re on Python 3.7 |
@myleott @johnfarina could you try and upgrade the Sacremoses? The current version should be
It's probably because of the It was too much of a performance cost for perfect accuracy on all possible characters, so the new version falls back to the only P/S: Weird that the PR auto-closes the issue.... |
Substantial improvement for Korean with version
English is slower, weirdly:
and Chinese takes almost 2 minutes on my machine, which is still a bit painful:
|
@johnfarina Which Python version are you using for the above benchmark? |
I have the same issue. It looks like it is indeed related to Python 3.7 🤔: With Python 3.6.1 (Amazon Linux), sacremoses 0.0.24:
With Python 3.7.3 (Amazon Linux), sacremoses 0.0.24:
|
This was python 3.7.3 (via anaconda) on Mac OS 10.14.1. I tried the same with 3.7.1 on Mac and Ubuntu 16.04 too with similar results. |
After doing some profiling with cProfile, the issue is indeed caused by a regression on Python >= 3.7, more precisely by I've created a PR on Python repo which fixes the issue we have here. See: python/cpython#15030 I came up with a very dirty quick fix (to be run before importing sacremoses, only on Python >= 3.7) import sre_parse
sre_parse._uniq = lambda x: list(dict.fromkeys(x)) |
Thanks @yannvgn!! Great to see this resolved! |
The issue has been resolved on upstream. https://github.com/alvations/sacremoses/issues/61 Test run time on Circle CI: ~= 0.4 second.
As in, it takes several minutes. Seems to happen independent ot the specified
lang
.Subsequent calls perform as expected:
Latest version of sacremoses (0.0.22). Is this a problem for anyone else?
The text was updated successfully, but these errors were encountered: