first call to MosesTokenizer.tokenize is very slow #61

johnfarina · 2019-07-16T16:09:18Z

As in, it takes several minutes. Seems to happen independent ot the specified lang.

In [1]: from sacremoses import MosesTokenizer

In [2]: mt = MosesTokenizer(lang='ko')

In [3]: %time mt.tokenize("세계 에서 가장 강력한")
CPU times: user 3min 3s, sys: 1.75 s, total: 3min 5s
Wall time: 3min 11s
Out[3]: ['세계', '에서', '가장', '강력한']

Subsequent calls perform as expected:

In [4]: %time mt.tokenize("세계 에서 가장 강력한")
CPU times: user 819 µs, sys: 0 ns, total: 819 µs
Wall time: 823 µs
Out[4]: ['세계', '에서', '가장', '강력한']

Latest version of sacremoses (0.0.22). Is this a problem for anyone else?

The text was updated successfully, but these errors were encountered:

johnfarina · 2019-07-16T16:24:18Z

With English, it actually seems to hang forever (I Ctrl-c'd the process after waiting for 15 minutes).

I think it's getting hung compiling regular expressions somewhere.

In [2]: mt = MosesTokenizer(lang='en')

In [3]: mt.tokenize("Hello, Mr. Smith!")

alvations · 2019-07-16T16:45:03Z

The first behavior of using the tokenizer first time seems reasonable. The regexes would be compiled and cached and in the case of the new expanded perluniprop files, they're huge, so it makes sense.

Second behavior of English hanging, shouldn't be the case:

[in]:

%time 
mt.tokenize("Hello, Mr. Smith!")

[out]:

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.15 µs

Might be some cache in the perluniprop files or some system problems.

Which OS are you using? Which Python verion?

alvations · 2019-07-16T16:50:41Z

But it do looks like the new version of the full perluniprop is does feel slower =(
I'll run some benchmark.

johnfarina · 2019-07-16T16:56:14Z

I'm on MacOS 10.14.1, python 3.7.1 (via anaconda). I'll run some more tests on my end with English later today on some different OSs and python installs too to see if I can isolate the problem.

333aleix333 · 2019-07-17T09:54:33Z

I'm having the same isssue, Ubuntu 18.04 and python 3.7.3

333aleix333 · 2019-07-17T11:21:56Z

It works with python 3.6.8

myleott · 2019-07-29T14:28:32Z

Same issue here. Seems it's something wrong with re on Python 3.7

alvations · 2019-07-29T20:23:38Z

@myleott @johnfarina could you try and upgrade the Sacremoses? The current version should be 0.0.24.

The reason behind the slowness don't seem to be the Python distribution. If any changes to the speed, upgrading Python should speed up regexes, https://docs.python.org/3/whatsnew/3.7.html (esp. with the flags regex compilation).

It's probably because of the unichars -au inclusion of unamed characters from Perluniprops to resolve the CJK tokenization issues from https://github.com/alvations/sacremoses/issues/42. That caused the list of characters in IsAlpha to grow from 21674 to 476052 bytes and IsAlnum grew from 22414 to 478372.

It was too much of a performance cost for perfect accuracy on all possible characters, so the new version falls back to the only unichars without -au and statically adds the CJK characters as per needed instead of adding the universe of alphanumeric characters all the time.

P/S: Weird that the PR auto-closes the issue....

johnfarina · 2019-07-30T04:47:24Z

Substantial improvement for Korean with version 0.0.24!

In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="ko")
In [3]: %time mt.tokenize("세계 에서 가장 강력한")
  CPU times: user 5.84 s, sys: 41.7 ms, total: 5.88 s
Wall time: 6.04 s
Out[3]: ['세계', '에서', '가장', '강력한']

English is slower, weirdly:

In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="en")
In [3]: %time mt.tokenize("Hello, World!")
CPU times: user 11.6 s, sys: 89.9 ms, total: 11.7 s
Wall time: 11.9 s
Out[3]: ['Hello', ',', 'World', '!']

and Chinese takes almost 2 minutes on my machine, which is still a bit painful:

In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="zh")
In [3]: %time mt.tokenize("记者 应谦 美国")
CPU times: user 1min 54s, sys: 878 ms, total: 1min 55s
Wall time: 1min 56s
Out[3]: ['记者', '应谦', '美国']

alvations · 2019-07-30T06:06:21Z

@johnfarina Which Python version are you using for the above benchmark?

yannvgn · 2019-07-30T12:42:09Z

I have the same issue. It looks like it is indeed related to Python 3.7 🤔:

With Python 3.6.1 (Amazon Linux), sacremoses 0.0.24:

In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="en")
In [3]: %time mt.tokenize("Hello, World!")
CPU times: user 220 ms, sys: 0 ns, total: 220 ms
Wall time: 220 ms
Out[3]: ['Hello', ',', 'World', '!']

With Python 3.7.3 (Amazon Linux), sacremoses 0.0.24:

In [1]: from sacremoses import MosesTokenizer
In [2]: mt = MosesTokenizer(lang="en")
In [3]: %time mt.tokenize("Hello, World!")
CPU times: user 21.1 s, sys: 10 ms, total: 21.1 s
Wall time: 21.1 s
Out[3]: ['Hello', ',', 'World', '!']

johnfarina · 2019-07-30T15:55:22Z

@johnfarina Which Python version are you using for the above benchmark?

This was python 3.7.3 (via anaconda) on Mac OS 10.14.1. I tried the same with 3.7.1 on Mac and Ubuntu 16.04 too with similar results.

yannvgn · 2019-07-30T22:23:19Z

After doing some profiling with cProfile, the issue is indeed caused by a regression on Python >= 3.7, more precisely by sre_parse._uniq function, which did not exist on Python <= 3.6.

I've created a PR on Python repo which fixes the issue we have here. See: python/cpython#15030

I came up with a very dirty quick fix (to be run before importing sacremoses, only on Python >= 3.7)

import sre_parse
sre_parse._uniq = lambda x: list(dict.fromkeys(x))

alvations · 2020-04-13T04:11:42Z

Thanks @yannvgn!! Great to see this resolved!

The issue has been resolved on upstream. https://github.com/alvations/sacremoses/issues/61 Test run time on Circle CI: ~= 0.4 second.

alvations added the bug Something isn't working label Jul 29, 2019

alvations mentioned this issue Jul 29, 2019

Patching slowness in v0.0.22 #62

Merged

alvations closed this as completed in #62 Jul 29, 2019

alvations reopened this Jul 29, 2019

zhangguanheng66 mentioned this issue Aug 19, 2019

Temporarily remove nltk downloader in travis CI tests pytorch/text#588

Merged

yannvgn mentioned this issue Aug 24, 2019

patch sre_parse performance issue (cf bpo-37723) yannvgn/laserembeddings#4

Merged

alvations added the fixed label Apr 13, 2020

alvations closed this as completed Apr 13, 2020

alvations added a commit that referenced this issue Apr 14, 2020

discourage py37, see #61

cc0617f

mthrok mentioned this issue May 18, 2020

Do not skip sacremoses tokenizer test pytorch/text#782

Merged

mthrok referenced this issue in pytorch/text May 18, 2020

Remove slow from sacremoses (#782)

00e793f

The issue has been resolved on upstream. https://github.com/alvations/sacremoses/issues/61 Test run time on Circle CI: ~= 0.4 second.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first call to MosesTokenizer.tokenize is very slow #61

first call to MosesTokenizer.tokenize is very slow #61

johnfarina commented Jul 16, 2019

johnfarina commented Jul 16, 2019

alvations commented Jul 16, 2019

alvations commented Jul 16, 2019

johnfarina commented Jul 16, 2019

333aleix333 commented Jul 17, 2019

333aleix333 commented Jul 17, 2019

myleott commented Jul 29, 2019

alvations commented Jul 29, 2019 •

edited

johnfarina commented Jul 30, 2019

alvations commented Jul 30, 2019

yannvgn commented Jul 30, 2019 •

edited

johnfarina commented Jul 30, 2019

yannvgn commented Jul 30, 2019 •

edited

alvations commented Apr 13, 2020

first call to MosesTokenizer.tokenize is very slow #61

first call to MosesTokenizer.tokenize is very slow #61

Comments

johnfarina commented Jul 16, 2019

johnfarina commented Jul 16, 2019

alvations commented Jul 16, 2019

alvations commented Jul 16, 2019

johnfarina commented Jul 16, 2019

333aleix333 commented Jul 17, 2019

333aleix333 commented Jul 17, 2019

myleott commented Jul 29, 2019

alvations commented Jul 29, 2019 • edited

johnfarina commented Jul 30, 2019

alvations commented Jul 30, 2019

yannvgn commented Jul 30, 2019 • edited

johnfarina commented Jul 30, 2019

yannvgn commented Jul 30, 2019 • edited

alvations commented Apr 13, 2020

alvations commented Jul 29, 2019 •

edited

yannvgn commented Jul 30, 2019 •

edited

yannvgn commented Jul 30, 2019 •

edited