Space deduplication #79

alvations · 2019-11-25T02:44:26Z

Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507

But in Python, this can be don't by doing " ".join(text.split())

Would it be appropriate to change the space de-duplication behavior in sacremoses to use " ".join(text.split()) ?

The text was updated successfully, but these errors were encountered:

goodmami · 2020-04-02T05:40:24Z

The first issue is whether Python's str.split() uses the same definition of a space as \s in Python regular expressions. To test this, I got took the list of unicode whitespace characters from Wikipedia and compared str.split() with re.sub():

import re

spaces = (
  '\t',      # tab
  '\n',      # newline
  '\v',      # vertical tab
  '\f',      # form feed
  '\r',      # carriage return
  '  ',      # space
  '\u0085',  # next line
  '\u00a0',  # non-breaking space
  '\u1680',  # ogham space mark
  '\u2000',  # en quad
  '\u2001',  # em quad
  '\u2002',  # en space
  '\u2003',  # em space
  '\u2004',  # three-per-em space
  '\u2005',  # four-per-em space
  '\u2006',  # six-per-em space
  '\u2007',  # figure space
  '\u2008',  # punctuation space
  '\u2009',  # thin space
  '\u200a',  # hair space
  '\u2028',  # line separator
  '\u2029',  # paragraph separator
  '\u202f',  # narrow no-break space
  '\u205f',  # medium mathematical space
  '\u3000',  # ideographic space
)

for sp in spaces:
    s = f'a{sp}b'
    assert ' '.join(s.split()) == re.sub(r'\s+', ' ', s)

I get no AssertionError when I run this with Python 3.6, so that's good news. Note that this is only the case if the regular expression is compiled in unicode mode, which is the default for Python 3. There may be some implementations or builds of Python that do not consider unicode spaces, but I am ignoring that possibility.

The next question is whether ' '.join(s.split()) results in the same output as re.sub(r'\s+', ' ', s), and this is not always true, particularly when there are leading or trailing spaces:

>>> import re
>>> s = ' a   b '
>>> ' '.join(s.split())
'a b'
>>> re.sub(r'\s+', ' ', s)
' a b '

As long as you are currently always doing s.strip() before or after the space deduplication, then you should be safe.

The final question (that I came up with) is whether doing so has any performance gains. I think in general a simple string operation is faster than invoking the regex engine, and timeit confirms it here:

$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' '" ".join(s.split())'
500000 loops, best of 5: 0.521 usec per loop
$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' 'r.sub(" ", s)'
100000 loops, best of 5: 3.73 usec per loop

So assuming you are currently removing leading or trailing spaces, deduplicating spaces with str.split() seems like a good idea.

alvations · 2020-04-13T04:16:33Z

Thanks @goodmami for the analysis!!

Hmmm, seems like considering things like em-space and ideographic space might be problematic because I think it affects other regexes in Moses. Have to take a closer look at other regexes that manipulates/uses these "alternative" spaces in the moses' list of regexes.

goodmami · 2020-04-13T04:31:03Z

Yes, all I have shown was that the basic string operations are suitable to replace Python's re.sub() if the leading and trailing spaces are accounted for, and I make no claims as to either way's compatibility with Moses. If you're currently using re.sub() without trouble, then I don't expect any new issues to arise, but if it differs from Moses in some way, re.sub() would be more flexible if you need special-casing.

alvations added help wanted Extra attention is needed question Further information is requested labels Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Space deduplication #79

Space deduplication #79

alvations commented Nov 25, 2019

goodmami commented Apr 2, 2020

alvations commented Apr 13, 2020

goodmami commented Apr 13, 2020

Space deduplication #79

Space deduplication #79

Comments

alvations commented Nov 25, 2019

goodmami commented Apr 2, 2020

alvations commented Apr 13, 2020

goodmami commented Apr 13, 2020