Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Chinese discussion #45

Open
Tracked by #425
XapaJIaMnu opened this issue Dec 27, 2021 · 2 comments
Open
Tracked by #425

Support Chinese discussion #45

XapaJIaMnu opened this issue Dec 27, 2021 · 2 comments
Labels
language-coverage Issues related to covering specific languages

Comments

@XapaJIaMnu
Copy link
Contributor

XapaJIaMnu commented Dec 27, 2021

Chinese poses several unique challenges not present in other language pairs. I will start this mega-issue and update the individual points that need to happen for those languages to be fully supported

  1. Language detection: Some Chinese corpora are not tagged with zh, but with zh_{tw,zh,hk...} etc. it would be helpful if find_corpus.py checks for those when checking for zh..
  2. Chinese script comes in traditional and simplified variety. Most big translation vendors support both. Converting traditional to simplified (and vice versa) can be easily achieved through hanzi-conv https://pypi.org/project/hanziconv/0.3/ . There might be a very small information loss when converting simplified to traditional, but it should be fine in 99.9% of the cases. Some datasets such as ted talks come in traditional, so they should be converted before using.
  3. Preprocessing filters: Chinese alphabet should be added. In general we can use a unicode ranges to do so, but they are somewhat complicated: https://stackoverflow.com/questions/43418812/check-whether-a-string-contains-japanese-chinese-characters In the past i have used something like u'[\u4e00-\u9fff]', but this may be improved.
  4. Segmentation. Chinese text is typically inputted unsegmented, however some of the datasets online contain segmentation. We should use a de-segmentation script like this one (this script also tries to fix some datasets in Chinese finishing in a comma as opposed to a fulstop, but this can be extracted away from the script):
#!/usr/bin/env python

import re
import sys


re_space = re.compile(r"(?<![a-zA-Z])\s(?![a-zA-Z])", flags=re.UNICODE)
re_final_comma = re.compile("\.$")


for line in sys.stdin:
    line = line[:-1] #EoL
    line = line.strip()
    line.replace(' ', "")
    if line[-1] == ',':
        line = line[:-1] + u"\u3002"
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    line = re_space.sub("", line)
    line = line.replace(",", u"\uFF0C")
    line = re_final_comma.sub(u"\u3002", line)
    print(line)

This script essentially tries to identify Chinese characters and remove in spaces between them. It can probably be improved as currently the space between English words is lost, whereas we should write something more complicated that detects a continues substring of English words and leaves them alone.

  1. Length filtering. As Chinese sentences come normally as one continuous string of characters, traditional length filtering doesn't work. Furthermore, as one word can be made of 1-4 Chinese characters, we can't have some hard-and-fast conversion rule. What people normally do is they use a Chinese tokenizer (like jieba https://github.com/fxsjy/jieba#jieba-1 ) to split the Chinese text to words. We can then safely apply the filtering here:
    ratio_len = src_len / float(trg_len)

    Most papers recommend to discard lines where the ratio of English to Chinese or Chinese to English words is more than 1.3

Afterwards the text should be de-segmented again and prepared for training

  1. Corpus specific fixes. The UN corpus doesn't contain fulstops (for example) and we use something like this to fix it:
import sys

for line in sys.stdin:
    line = line[:-1] #EoL
    if line[-1] == ',':
        line = line[:-1] + '.'
    if line[-1] == ' ':
        line = line[:-1]
    print(line)

(This script is integrated in the previous copy/paste of script).

All of these steps except 2) Apply to Japanese as well. Japanese tokenizer should be used in place of jieba for japanese.

@eu9ene eu9ene added language-coverage Issues related to covering specific languages epic labels Jan 4, 2022
@ZJaume
Copy link

ZJaume commented Feb 11, 2022

Hi!

Been working with a student for traditional Chinese (not high quality, only for alignment purposes) and maybe some of my experience can be useful for you.

  • Most of the corpora downloaded from Opus is in simplified, so support traditional needs additional attention. Here some counts using hanzidentifier:

OPUS (neulab_tedtalksv1_train news_commentary_v14 OPUS_UN_v20090831 OPUS_UNPC_v1_0 OPUS_MultiUN_v1 OPUS_QED_v2_0a)

14711511 SIMP
   2267 MIXED
  88777 BOTH
   9403 TRAD

WikiMatrix

1141562 SIMP
1047046 TRAD
 221854 MIXED
  28071 BOTH
      1 UNK

CCAligned

9686412 SIMP
 109136 MIXED
  77473 BOTH
   5627 TRAD
  • For traditional-simplified conversion I found OpenCC that seems to have better support and active development than hanziconv. Punctuation also needs to be converted aside.

  • Translating from traditional Chinese to English text from a domain that was not present in the training corpora. That had a considerable amount of characters unknown by the SentencePiece vocabulary. This characters were logographs but also punctuation (I realised that the conversion to traditional didn't convert the ASCII punctuation to Chinese punctuation and they were unkown to the vocab). The result was a lot of random behaviour that wasn't detected with the WMT test sets. This is an example from the output:

Some of the monkeys can be fixed cursed monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey monkey
Only the emperor and his sin's puppets need to do so.
First of all, we focus on the system rather than any kind of skill; only how to interact with the strength of the puppets of the puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet puppet
Some of the monkeys can be fixed to "the monkeys to accept" monkeys; the monkeys are too small, and there are shadows.

Some of the unknown characters that were part of a noun were being translated by 'monkey' or 'puppet' and the absence of an ASCII period at the end caused repetition.

This model (zh_Hant->English) had 20.3 of BLEU score under WMT19 converted to traditional with OpenCC.

@ZJaume
Copy link

ZJaume commented Feb 11, 2022

I solved the issue of character coverage training with all the traditional converted to pinyin with this script:

from unicodedata import category as cat
from unidecode import unidecode as uni
from pypinyin import pinyin
import sys

# tell if a str contains punctuation
def is_punc(string):
    return any([cat(i).startswith('P') for i in string])

for line in sys.stdin:
    pyin = pinyin(line.rstrip('\n'))
    # Flatten the list and unidecode strings with punctuation
    pyin = [uni(i[0]) if is_punc(i[0]) else i[0] for i in pyin]
    print(' '.join(pyin))

The model lost 1 point of BLEU and the student a couple more with this approach, but the monkeys disappeared.

@eu9ene eu9ene removed the epic label Mar 5, 2024
@gregtatum gregtatum changed the title Support Chinese megaissue Support Chinese discussion Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

3 participants