Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization for Hindi (e.g. क्या) is weird #42

Open
alvations opened this issue Mar 28, 2019 · 6 comments
Open

Tokenization for Hindi (e.g. क्या) is weird #42

alvations opened this issue Mar 28, 2019 · 6 comments
Labels
bug Something isn't working

Comments

@alvations
Copy link
Contributor

>>> from sacremoses import MosesTokenizer
>>> mt = MosesTokenizer()
>>> mt.tokenize('क्या')
['क', '्', 'या']
@johnfarina
Copy link

The same is true for both Chinese and Korean as well. sacremoses splits all characters:

Here's some Chinese:

>>> mt = MosesTokenizer(lang='zh')
>>> mt.tokenize("记者 应谦 美国")
['记', '者', '应', '谦', '美', '国']

And some Korean:

mt = MosesTokenizer(lang='ko')
mt.tokenize("세계 에서 가장 강력한")
['세', '계', '에', '서', '가', '장', '강', '력', '한']

Which is a shame, as I'd really like to use sacremoses as the tokenizer with LASER instead of using subprocess and temp files to call the moses perl scripts.

@alvations
Copy link
Contributor Author

Expected behavior for zh and ko:

$ echo "记者 应谦 美国"  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l zh 
Tokenizer Version 1.1
Language: zh
Number of threads: 1
记者 应谦 美国

$ echo ""세계 에서 가장 강력한""  | ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ko 
Tokenizer Version 1.1
Language: ko
Number of threads: 1
WARNING: No known abbreviations for language 'ko', attempting fall-back to English version...
세계 에서 가장 강력한

@alvations alvations added the bug Something isn't working label Jul 16, 2019
@alvations alvations changed the title Tokenization for Hindi क्या is weird Tokenization for Hindi (e.g. क्या) and CJK is weird Jul 16, 2019
@alvations
Copy link
Contributor Author

alvations commented Jul 16, 2019

Looks like it's the the unichars list and the perluniprops list of Alphanumeric is a little different.

The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces.

It looks like the \p{IsAlnum} includes the CJK:

$ echo "记者 应谦 美国" | sed "s/([^\p{IsAlnum}\s\.\'\`\,\-])/ $1 /g"
记者 应谦 美国

But when we check unichars, it's missing:

$ unichars '\p{Alnum}' | cut -f2 -d' ' | grep "记"

Using the unichars -au option works:

$ unichars -au '\p{Alnum}' | cut -f2 -d' ' | grep "记"
记

Note: see https://webcache.googleusercontent.com/search?q=cache:bmLqeEnWJa0J:https://codeday.me/en/qa/20190306/8531.html+&cd=6&hl=en&ct=clnk&gl=sg

@alvations alvations changed the title Tokenization for Hindi (e.g. क्या) and CJK is weird Tokenization for Hindi (e.g. क्या) is weird Jul 16, 2019
@alvations
Copy link
Contributor Author

alvations commented Jul 16, 2019

@johnfarina Thanks for spotting that! The latest PR should #60 resolve the CJK issues.

The Hindi one is a little more complicated, so leaving this PR open.

pip install -U sacremoses>=0.0.22

@johnfarina
Copy link

Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !!

@mtresearcher
Copy link

@alvations any update on the hindi tokenization issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants