Weird results for Tamil and Russian tokenization #73

BLKSerene · 2019-10-06T04:39:40Z

Hi, the results for Tamil tokenization is weird:

>>> import sacremoses
>>> TEXT_TAM = 'தமிழ் மொழி (Tamil language) தமிழர்களினதும், தமிழ் பேசும் பலரதும் தாய்மொழி ஆகும்.'
>>> EXPECTED_TOKENS = ['தமிழ்', 'மொழி', '(', 'Tamil', 'language', ')', 'தமிழர்களினதும்', ',', 'தமிழ்', 'பேசும்', 'பலரதும்', 'தாய்மொழி', 'ஆகும்', '.']

>>> t = sacremoses.MosesTokenizer(lang = 'ta')
>>> t.tokenize(TEXT_TAM)
['தமிழ', '்', 'மொழி', '(', 'Tamil', 'language', ')', 'தமிழர', '்', 'களினதும', '்', ',', 'தமிழ', '்', 'பேசும', '்', 'பலரதும', '்', 'தாய', '்', 'மொழி', 'ஆகும', '்', '.']
>>> assert t.tokenize(TEXT_TAM) == EXPECTED_TOKENS
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    assert t.tokenize(TEXT_TAM) == EXPECTED_TOKENS
AssertionError

The text was updated successfully, but these errors were encountered:

yannvgn · 2019-10-06T07:47:58Z

Just a note: Moses tokenizer has the same behavior:

$ echo "தமிழ் மொழி (Tamil language) தமிழர்களினதும், தமிழ் பேசும் பலரதும் தாய்மொழி ஆகும்." | tokenizer.perl -q -no-escape -l ta
தமிழ ் மொழி ( Tamil language ) தமிழர ் களினதும ் , தமிழ ் பேசும ் பலரதும ் தாய ் மொழி ஆகும ் .

alvations · 2019-10-06T09:11:23Z

Yes actually this is the same as the Hindi problem at #42

There's a way to resolve this but it requires a little more digging and understanding of Indian languages in the unicode charset =(

This is caused by https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L41 padding spaces to characters which it this isn't alphanumeric from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L24

Adding the ் character to the IsAlnum list and uncommenting the character will fix this issue partially but there's actually more problems than that unicode character.

BLKSerene · 2019-10-07T07:58:50Z

And the problem also exists for Russian:

>>> import sacremoses
>>> TEXT_RUS = 'Ру́сский язы́к ([ˈruskʲɪi̯ jɪˈzɨk] Информация о файле слушать)[~ 3][⇨] — один из восточнославянских языков, национальный язык русского народа.'
>>> EXPECTED_TOKENS = ['Ру́сский', 'язы́к', '(', '[', 'ˈruskʲɪi̯', 'jɪˈzɨk', ']', 'Информация', 'о', 'файле', 'слушать', ')', '[', '~', '3', ']', '[', '⇨', ']', '—', 'один', 'из', 'восточнославянских', 'языков', ',', 'национальный', 'язык', 'русского', 'народа', '.']

>>> t = sacremoses.MosesTokenizer(lang = 'ru')
>>> t.tokenize(TEXT_RUS)
['Ру', '́', 'сский', 'язы', '́', 'к', '(', '&#91;', 'ˈruskʲɪi', '̯', 'jɪˈzɨk', '&#93;', 'Информация', 'о', 'файле', 'слушать', ')', '&#91;', '~', '3', '&#93;', '&#91;', '⇨', '&#93;', '—', 'один', 'из', 'восточнославянских', 'языков', ',', 'национальный', 'язык', 'русского', 'народа', '.']
>>> assert t.tokenize(TEXT_RUS) == EXPECTED_TOKENS
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    assert t.tokenize(TEXT_RUS) == EXPECTED_TOKENS
AssertionError

alvations · 2019-10-07T08:39:28Z

Same outputs from default mosesdecoder (Commit: moses-smt/mosesdecoder@0578892) :

$ echo "Ру́сский язы́к ([ˈruskʲɪi̯ jɪˈzɨk] Информация о файле слушать)[~ 3][⇨] — один из восточнославянских языков, национальный язык русского народа." | perl tokenizer.perl -l ru
Tokenizer Version 1.1
Language: ru
Number of threads: 1
Ру ́ сский язы ́ к ( &#91; ˈruskʲɪi ̯ jɪˈzɨk &#93; Информация о файле слушать ) &#91; ~ 3 &#93; &#91; ⇨ &#93; — один из восточнославянских языков , национальный язык русского народа .

alvations added the bug Something isn't working label Oct 6, 2019

alvations changed the title ~~Weird results for Tamil tokenization~~ Weird results for Tamil and Russian tokenization Oct 7, 2019

BLKSerene closed this as completed Aug 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird results for Tamil and Russian tokenization #73

Weird results for Tamil and Russian tokenization #73

BLKSerene commented Oct 6, 2019

yannvgn commented Oct 6, 2019

alvations commented Oct 6, 2019 •

edited

BLKSerene commented Oct 7, 2019

alvations commented Oct 7, 2019

Weird results for Tamil and Russian tokenization #73

Weird results for Tamil and Russian tokenization #73

Comments

BLKSerene commented Oct 6, 2019

yannvgn commented Oct 6, 2019

alvations commented Oct 6, 2019 • edited

BLKSerene commented Oct 7, 2019

alvations commented Oct 7, 2019

alvations commented Oct 6, 2019 •

edited