Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space deduplication #79

Open
alvations opened this issue Nov 25, 2019 · 3 comments
Open

Space deduplication #79

alvations opened this issue Nov 25, 2019 · 3 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@alvations
Copy link
Contributor

Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507

But in Python, this can be don't by doing " ".join(text.split())

Would it be appropriate to change the space de-duplication behavior in sacremoses to use " ".join(text.split()) ?

@alvations alvations added help wanted Extra attention is needed question Further information is requested labels Nov 25, 2019
@goodmami
Copy link

goodmami commented Apr 2, 2020

The first issue is whether Python's str.split() uses the same definition of a space as \s in Python regular expressions. To test this, I got took the list of unicode whitespace characters from Wikipedia and compared str.split() with re.sub():

import re

spaces = (
  '\t',      # tab
  '\n',      # newline
  '\v',      # vertical tab
  '\f',      # form feed
  '\r',      # carriage return
  '  ',      # space
  '\u0085',  # next line
  '\u00a0',  # non-breaking space
  '\u1680',  # ogham space mark
  '\u2000',  # en quad
  '\u2001',  # em quad
  '\u2002',  # en space
  '\u2003',  # em space
  '\u2004',  # three-per-em space
  '\u2005',  # four-per-em space
  '\u2006',  # six-per-em space
  '\u2007',  # figure space
  '\u2008',  # punctuation space
  '\u2009',  # thin space
  '\u200a',  # hair space
  '\u2028',  # line separator
  '\u2029',  # paragraph separator
  '\u202f',  # narrow no-break space
  '\u205f',  # medium mathematical space
  '\u3000',  # ideographic space
)

for sp in spaces:
    s = f'a{sp}b'
    assert ' '.join(s.split()) == re.sub(r'\s+', ' ', s)

I get no AssertionError when I run this with Python 3.6, so that's good news. Note that this is only the case if the regular expression is compiled in unicode mode, which is the default for Python 3. There may be some implementations or builds of Python that do not consider unicode spaces, but I am ignoring that possibility.

The next question is whether ' '.join(s.split()) results in the same output as re.sub(r'\s+', ' ', s), and this is not always true, particularly when there are leading or trailing spaces:

>>> import re
>>> s = ' a   b '
>>> ' '.join(s.split())
'a b'
>>> re.sub(r'\s+', ' ', s)
' a b '

As long as you are currently always doing s.strip() before or after the space deduplication, then you should be safe.

The final question (that I came up with) is whether doing so has any performance gains. I think in general a simple string operation is faster than invoking the regex engine, and timeit confirms it here:

$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' '" ".join(s.split())'
500000 loops, best of 5: 0.521 usec per loop
$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' 'r.sub(" ", s)'
100000 loops, best of 5: 3.73 usec per loop

So assuming you are currently removing leading or trailing spaces, deduplicating spaces with str.split() seems like a good idea.

@alvations
Copy link
Contributor Author

Thanks @goodmami for the analysis!!

Hmmm, seems like considering things like em-space and ideographic space might be problematic because I think it affects other regexes in Moses. Have to take a closer look at other regexes that manipulates/uses these "alternative" spaces in the moses' list of regexes.

@goodmami
Copy link

Yes, all I have shown was that the basic string operations are suitable to replace Python's re.sub() if the leading and trailing spaces are accounted for, and I make no claims as to either way's compatibility with Moses. If you're currently using re.sub() without trouble, then I don't expect any new issues to arise, but if it differs from Moses in some way, re.sub() would be more flexible if you need special-casing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants