Tokenization for Hindi (e.g. `क्या`) is weird #42

alvations · 2019-03-28T08:29:04Z

>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()
>>> mt.tokenize('क्या')
['क', '्', 'या']

The text was updated successfully, but these errors were encountered:

johnfarina · 2019-07-16T05:59:32Z

The same is true for both Chinese and Korean as well. sacremoses splits all characters:

Here's some Chinese:

>>> mt = MosesTokenizer(lang='zh')
>>> mt.tokenize("记者 应谦 美国")
['记', '者', '应', '谦', '美', '国']

And some Korean:

mt = MosesTokenizer(lang='ko')
mt.tokenize("세계 에서 가장 강력한")
['세', '계', '에', '서', '가', '장', '강', '력', '한']

Which is a shame, as I'd really like to use sacremoses as the tokenizer with LASER instead of using subprocess and temp files to call the moses perl scripts.

alvations · 2019-07-16T06:07:59Z

Expected behavior for zh and ko:

$ echo "记者 应谦 美国"  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh 
Tokenizer Version 1.1
Language: zh
Number of threads: 1
记者 应谦 美国

$ echo ""세계 에서 가장 강력한""  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ko 
Tokenizer Version 1.1
Language: ko
Number of threads: 1
WARNING: No known abbreviations for language 'ko', attempting fall-back to English version...
세계 에서 가장 강력한

alvations · 2019-07-16T06:46:08Z

Looks like it's the the unichars list and the perluniprops list of Alphanumeric is a little different.

The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces.

It looks like the \p{IsAlnum} includes the CJK:

$ echo "记者 应谦 美国" | sed "s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g"
记者 应谦 美国

But when we check unichars, it's missing:

$ unichars '\p{Alnum}' | cut -f2 -d' ' | grep "记"

Using the unichars -au option works:

$ unichars -au '\p{Alnum}' | cut -f2 -d' ' | grep "记"
记

Note: see https://webcache.googleusercontent.com/search?q=cache:bmLqeEnWJa0J:https://codeday.me/en/qa/20190306/8531.html+&cd=6&hl=en&ct=clnk&gl=sg

alvations · 2019-07-16T07:18:12Z

@johnfarina Thanks for spotting that! The latest PR should #60 resolve the CJK issues.

The Hindi one is a little more complicated, so leaving this PR open.

pip install -U sacremoses>=0.0.22

johnfarina · 2019-07-16T14:35:16Z

Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !!

mtresearcher · 2020-06-04T20:56:15Z

@alvations any update on the hindi tokenization issue?

alvations added the bug Something isn't working label Jul 16, 2019

alvations changed the title ~~Tokenization for Hindi क्या is weird~~ Tokenization for Hindi (e.g. क्या) and CJK is weird Jul 16, 2019

alvations mentioned this issue Jul 16, 2019

Updated with perluniprops unichars -au #60

Merged

alvations changed the title ~~Tokenization for Hindi (e.g. क्या) and CJK is weird~~ Tokenization for Hindi (e.g. क्या) is weird Jul 16, 2019

alvations mentioned this issue Jul 29, 2019

first call to MosesTokenizer.tokenize is very slow #61

Closed

alvations mentioned this issue Oct 6, 2019

Weird results for Tamil and Russian tokenization #73

Closed

alvations mentioned this issue Jul 8, 2020

Improvements: virama and nukthas of Indic languages, easy way to specify basic protected patterns #103

Merged

erip mentioned this issue Feb 16, 2022

English-Hindi Tokenization for the baseline model amazon-science/contrastive-controlled-mt#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization for Hindi (e.g. `क्या`) is weird #42

Tokenization for Hindi (e.g. `क्या`) is weird #42

alvations commented Mar 28, 2019

johnfarina commented Jul 16, 2019

alvations commented Jul 16, 2019

alvations commented Jul 16, 2019 •

edited

alvations commented Jul 16, 2019 •

edited

johnfarina commented Jul 16, 2019

mtresearcher commented Jun 4, 2020

Tokenization for Hindi (e.g. क्या) is weird #42

Tokenization for Hindi (e.g. क्या) is weird #42

Comments

alvations commented Mar 28, 2019

johnfarina commented Jul 16, 2019

alvations commented Jul 16, 2019

alvations commented Jul 16, 2019 • edited

alvations commented Jul 16, 2019 • edited

johnfarina commented Jul 16, 2019

mtresearcher commented Jun 4, 2020

Tokenization for Hindi (e.g. `क्या`) is weird #42

Tokenization for Hindi (e.g. `क्या`) is weird #42

alvations commented Jul 16, 2019 •

edited

alvations commented Jul 16, 2019 •

edited