-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Chinese discussion #45
Comments
Hi! Been working with a student for traditional Chinese (not high quality, only for alignment purposes) and maybe some of my experience can be useful for you.
OPUS (
WikiMatrix
CCAligned
Some of the unknown characters that were part of a noun were being translated by 'monkey' or 'puppet' and the absence of an ASCII period at the end caused repetition. This model (zh_Hant->English) had 20.3 of BLEU score under WMT19 converted to traditional with OpenCC. |
I solved the issue of character coverage training with all the traditional converted to pinyin with this script: from unicodedata import category as cat
from unidecode import unidecode as uni
from pypinyin import pinyin
import sys
# tell if a str contains punctuation
def is_punc(string):
return any([cat(i).startswith('P') for i in string])
for line in sys.stdin:
pyin = pinyin(line.rstrip('\n'))
# Flatten the list and unidecode strings with punctuation
pyin = [uni(i[0]) if is_punc(i[0]) else i[0] for i in pyin]
print(' '.join(pyin)) The model lost 1 point of BLEU and the student a couple more with this approach, but the monkeys disappeared. |
Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be fully supported
find_corpus.py
checks for those when checking for zh..firefox-translations-training/pipeline/clean/tools/clean_parallel.py
Line 51 in 3b3f33b
u'[\u4e00-\u9fff]'
, but this may be improved.This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.
firefox-translations-training/pipeline/clean/tools/clean_parallel.py
Line 93 in 3b3f33b
Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3
Afterwards the text should be de-segmented again and prepared for training
(This script is integrated in the previous copy/paste of script).
All of these steps except 2) Apply to Japanese as well. Japanese tokenizer should be used in place of jieba for japanese.
The text was updated successfully, but these errors were encountered: