Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Achieving Human Parity on Automatic Chinese to English News Translation #98

Open
kweonwooj opened this issue Mar 26, 2018 · 0 comments
Open

Comments

@kweonwooj
Copy link
Owner

Abstract

  • reports that the quality of Microsoft's Chinese-to-English machine translation on news sentences is at human parity
    • uses WMT2017 news translation data
    • defines how to accurately measure human parity in translation
    • describes the workflow and various experiments

Details

Defining Human Parity

  • official definition : If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
  • Evaluation method
    • use direct assessment described in WMT17
    • use source-based evaluation methodology described in IWSLT17
    • annotators are shown source text and a candidate translation and asked the questions How accurately does the above candidate text convey the semantics of the source text?, answering this using a slider ranging from 0 to 100 (100 being perfect)
    • to identify unreliable crowd workers, direct assessment includes artificially degraded translation output randomly

Neural Machine Translation

  • LSTM, ConvS2S, Transformer are all SoTA models, but choose Transformer as baseline

Main Contributions

  • Main Techniques used to achieve human parity
    • Careful data selection and filtering
    • Dual Learning to utilize the duality of the translation problem
    • Iterative joint training algorithm described in Zhang et al. 2018 to enhance the effect of monolingual source using Back Translation
    • Deliberation Network to refine translation based on two-pass decoding
    • New training objective over KL divergence to encourage agreement between left-to-right and right-to-left translation
    • System Combination and Re-ranking

Data Selection and Filtering

  • Learn a bilingual sentence vector representation mapped into the same space to filter the noisy data and select relevant data
    • use method in Zoph et al. 2016 on subset of data known to be of good quality and relevant domain
    • use RNN enc-dec similar to GNMT as base model for representation learning
    • use cosine similarity of sentence representation of source S and target T
    • remove sentences with similarity below a specified threshold
  • Rule-based filtering
    • both source and target sentence should contain at least 3 words, at most 70 words
    • pairs with ( src_len < 1.3 * tgt_len, tgt_len < 1.3 * src_len) removed
    • sentences with illegal chars (URL, char of other language) removed
    • Chinese sentence without any Chinese characters removed
    • duplicated sentence pairs are removed

Dual Learning

Iterative Joint Training

screen shot 2018-03-26 at 4 29 09 pm

Deliberation Network

screen shot 2018-03-26 at 4 29 28 pm

L2R, R2L Agreement Regularization

  • signals from R2L model can be leveraged to alleviate the exposure bias problem of L2R model and vice versa
    screen shot 2018-03-26 at 4 29 41 pm

System Combination and Re-ranking

  • combine n-best hypotheses from all systems and train a re-ranker using k-best MIRA (margin-based classification algorithm)
  • Features used for re-ranking are
    • original system score, 5-gram LM score, R2L score, Target2Source system re-score, cross-lingual sentence similarity between source and hypothesis
  • turned out to be that original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity were best features

NMT Pipeline

  • Train ZhEn, EnZh Transformer model using DUL, DSL with bilingual corpus (multiple models can be trained to leverage ensemble)
  • Generate back-translation corpus using En & Zh monolingual sentences and pre-trained models from previous step
  • Train Transformer Model or Deliberation Network with inflated bilingual corpus, use pre-trained model's weight to initialize encoder and first-pass decoder of Deliberation Network
    screen shot 2018-03-26 at 4 37 46 pm

Experiments - Benchmark on WMT17

  • Data
    • WMT17 EnZh 18M bilingual pairs. newsdev2017 as dev, newstest2017 as test set.
    • use LM trained on 18M bilingual pair to filter monolingual sentence from news.crawl and common.crawl
  • Vocab
    • Byte Pair Encoding (BPE) of Zh 44k, En 33k
  • Model
    • Transformer Big with Tensor2Tensor v1.3.0 open-source
    • 8x M40 GPUs
    • 200k Adam w learning_rate 0.3, decayed with noam schedule
    • 5120 words per batch, checkpoints created every 60 min
    • results are reported on averaged parameters of last 20 checkpoints
    • beam=8, length_penalty=1.0
    • reported score using sacreBLEU v1.2.3
  • BLEU Score
    • Back-translation (BT) + Dual Learning + Deliberation Network combination performs best
    • Agreement Regularization does not add improvement much
      screen shot 2018-03-26 at 5 02 13 pm

Experiment on Larger Corpus

  • Data
    • WMT17 18M + 35M/50M subset selected from 100M UN corpus
  • Vocab : same
  • Model
    • Transformer Big with 8,192 hidden_size in conv-1 block (bigger than original Transformer Big)
    • 300K Adam
    • minibatch of 3,500 with 8 GPUs
    • same beam, length_penalty and averaging param
  • BLEU Score
    • Base8k (larger model) performs better
    • additional corpus selected via SentVect enhances BLEU score best
      screen shot 2018-03-26 at 5 16 25 pm

Human Evaluation Results

  • Ensembles (Combo-4,5,6) obtains human parity (equivalent score with Reference-HT)
    • Reference-HT are human translations without using online translation engines
    • Reference-PE are human post-edit output based on Google Translate results
    • Reference-WMT are original newstest2017 reference released after WMT17
    • Online-A-1710 : Microsoft Translator collected on Oct 2017
    • Online-A-1710 : Google Translator collected on Oct 2017
      screen shot 2018-03-26 at 5 18 50 pm
      screen shot 2018-03-26 at 5 19 12 pm

Evaluation Campaigns

  • Motivated to resolve issues with human evaluation processes
    • Annotator variability : what if same annotator provides different results on same data? -> resolved by running three campaigns on same evaluation data, and has seen a near complete overlap
    • Data variability : conduct evaluations on completely different subsets of the test data. although, the test data may already be biased..

Human Analysis

  • preliminary human error analysis on best system
    screen shot 2018-03-26 at 5 26 40 pm

Personal Thoughts

  • Complete NMT workflow from data selection upto human evaluation and error analysis
  • impressed on intense experiments, but methods are biased toward MS ideas
  • human evaluation comparison between baseline model and improved model would have been interesting, see how the error types had been reduced

Link : https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
Authors : Hassan et al. 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant