Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird results for Tamil and Russian tokenization #73

Closed
BLKSerene opened this issue Oct 6, 2019 · 4 comments
Closed

Weird results for Tamil and Russian tokenization #73

BLKSerene opened this issue Oct 6, 2019 · 4 comments
Labels
bug Something isn't working

Comments

@BLKSerene
Copy link
Contributor

Hi, the results for Tamil tokenization is weird:

>>> import sacremoses
>>> TEXT_TAM = 'தமிழ் மொழி (Tamil language) தமிழர்களினதும், தமிழ் பேசும் பலரதும் தாய்மொழி ஆகும்.'
>>> EXPECTED_TOKENS = ['தமிழ்', 'மொழி', '(', 'Tamil', 'language', ')', 'தமிழர்களினதும்', ',', 'தமிழ்', 'பேசும்', 'பலரதும்', 'தாய்மொழி', 'ஆகும்', '.']

>>> t = sacremoses.MosesTokenizer(lang = 'ta')
>>> t.tokenize(TEXT_TAM)
['தமிழ', '்', 'மொழி', '(', 'Tamil', 'language', ')', 'தமிழர', '்', 'களினதும', '்', ',', 'தமிழ', '்', 'பேசும', '்', 'பலரதும', '்', 'தாய', '்', 'மொழி', 'ஆகும', '்', '.']
>>> assert t.tokenize(TEXT_TAM) == EXPECTED_TOKENS
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    assert t.tokenize(TEXT_TAM) == EXPECTED_TOKENS
AssertionError
@yannvgn
Copy link
Contributor

yannvgn commented Oct 6, 2019

Just a note: Moses tokenizer has the same behavior:

$ echo "தமிழ் மொழி (Tamil language) தமிழர்களினதும், தமிழ் பேசும் பலரதும் தாய்மொழி ஆகும்." | tokenizer.perl -q -no-escape -l ta
தமிழ ் மொழி ( Tamil language ) தமிழர ் களினதும ் , தமிழ ் பேசும ் பலரதும ் தாய ் மொழி ஆகும ் .

@alvations
Copy link
Contributor

alvations commented Oct 6, 2019

Yes actually this is the same as the Hindi problem at #42

There's a way to resolve this but it requires a little more digging and understanding of Indian languages in the unicode charset =(

This is caused by https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L41 padding spaces to characters which it this isn't alphanumeric from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L24

Adding the character to the IsAlnum list and uncommenting the character will fix this issue partially but there's actually more problems than that unicode character.

@alvations alvations added the bug Something isn't working label Oct 6, 2019
@BLKSerene
Copy link
Contributor Author

And the problem also exists for Russian:

>>> import sacremoses
>>> TEXT_RUS = 'Ру́сский язы́к ([ˈruskʲɪi̯ jɪˈzɨk] Информация о файле слушать)[~ 3][⇨] — один из восточнославянских языков, национальный язык русского народа.'
>>> EXPECTED_TOKENS = ['Ру́сский', 'язы́к', '(', '[', 'ˈruskʲɪi̯', 'jɪˈzɨk', ']', 'Информация', 'о', 'файле', 'слушать', ')', '[', '~', '3', ']', '[', '⇨', ']', '—', 'один', 'из', 'восточнославянских', 'языков', ',', 'национальный', 'язык', 'русского', 'народа', '.']

>>> t = sacremoses.MosesTokenizer(lang = 'ru')
>>> t.tokenize(TEXT_RUS)
['Ру', '́', 'сский', 'язы', '́', 'к', '(', '&#91;', 'ˈruskʲɪi', '̯', 'jɪˈzɨk', '&#93;', 'Информация', 'о', 'файле', 'слушать', ')', '&#91;', '~', '3', '&#93;', '&#91;', '⇨', '&#93;', '—', 'один', 'из', 'восточнославянских', 'языков', ',', 'национальный', 'язык', 'русского', 'народа', '.']
>>> assert t.tokenize(TEXT_RUS) == EXPECTED_TOKENS
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    assert t.tokenize(TEXT_RUS) == EXPECTED_TOKENS
AssertionError

@alvations
Copy link
Contributor

Same outputs from default mosesdecoder (Commit: moses-smt/mosesdecoder@0578892) :

$ echo "Ру́сский язы́к ([ˈruskʲɪi̯ jɪˈzɨk] Информация о файле слушать)[~ 3][⇨] — один из восточнославянских языков, национальный язык русского народа." | perl tokenizer.perl -l ru
Tokenizer Version 1.1
Language: ru
Number of threads: 1
Ру ́ сский язы ́ к ( &#91; ˈruskʲɪi ̯ jɪˈzɨk &#93; Информация о файле слушать ) &#91; ~ 3 &#93; &#91; ⇨ &#93; — один из восточнославянских языков , национальный язык русского народа .

@alvations alvations changed the title Weird results for Tamil tokenization Weird results for Tamil and Russian tokenization Oct 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants