Achieving Human Parity on Automatic Chinese to English News Translation #98

kweonwooj · 2018-03-26T08:29:41Z

Abstract

reports that the quality of Microsoft's Chinese-to-English machine translation on news sentences is at human parity
- uses WMT2017 news translation data
- defines how to accurately measure human parity in translation
- describes the workflow and various experiments

Details

Defining Human Parity

official definition : If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
Evaluation method
- use direct assessment described in WMT17
- use source-based evaluation methodology described in IWSLT17
- annotators are shown source text and a candidate translation and asked the questions How accurately does the above candidate text convey the semantics of the source text?, answering this using a slider ranging from 0 to 100 (100 being perfect)
- to identify unreliable crowd workers, direct assessment includes artificially degraded translation output randomly

Neural Machine Translation

LSTM, ConvS2S, Transformer are all SoTA models, but choose Transformer as baseline

Main Contributions

Main Techniques used to achieve human parity
- Careful data selection and filtering
- Dual Learning to utilize the duality of the translation problem
- Iterative joint training algorithm described in Zhang et al. 2018 to enhance the effect of monolingual source using Back Translation
- Deliberation Network to refine translation based on two-pass decoding
- New training objective over KL divergence to encourage agreement between left-to-right and right-to-left translation
- System Combination and Re-ranking

Data Selection and Filtering

Learn a bilingual sentence vector representation mapped into the same space to filter the noisy data and select relevant data
- use method in Zoph et al. 2016 on subset of data known to be of good quality and relevant domain
- use RNN enc-dec similar to GNMT as base model for representation learning
- use cosine similarity of sentence representation of source S and target T
- remove sentences with similarity below a specified threshold
Rule-based filtering
- both source and target sentence should contain at least 3 words, at most 70 words
- pairs with ( src_len < 1.3 * tgt_len, tgt_len < 1.3 * src_len) removed
- sentences with illegal chars (URL, char of other language) removed
- Chinese sentence without any Chinese characters removed
- duplicated sentence pairs are removed

Dual Learning

Dual Unsupervised Learning use reconstruction log-likelihood of monolingual corpus while training
Dual Supervised Learning trains primal and dual model simultaneously under regularization to encourage duality in probability distribution

Iterative Joint Training

Deliberation Network

L2R, R2L Agreement Regularization

signals from R2L model can be leveraged to alleviate the exposure bias problem of L2R model and vice versa

System Combination and Re-ranking

combine n-best hypotheses from all systems and train a re-ranker using k-best MIRA (margin-based classification algorithm)
Features used for re-ranking are
- original system score, 5-gram LM score, R2L score, Target2Source system re-score, cross-lingual sentence similarity between source and hypothesis
turned out to be that original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity were best features

NMT Pipeline

Train ZhEn, EnZh Transformer model using DUL, DSL with bilingual corpus (multiple models can be trained to leverage ensemble)
Generate back-translation corpus using En & Zh monolingual sentences and pre-trained models from previous step
Train Transformer Model or Deliberation Network with inflated bilingual corpus, use pre-trained model's weight to initialize encoder and first-pass decoder of Deliberation Network

Experiments - Benchmark on WMT17

Data
- WMT17 EnZh 18M bilingual pairs. newsdev2017 as dev, newstest2017 as test set.
- use LM trained on 18M bilingual pair to filter monolingual sentence from news.crawl and common.crawl
Vocab
- Byte Pair Encoding (BPE) of Zh 44k, En 33k
Model
- Transformer Big with Tensor2Tensor v1.3.0 open-source
- 8x M40 GPUs
- 200k Adam w learning_rate 0.3, decayed with noam schedule
- 5120 words per batch, checkpoints created every 60 min
- results are reported on averaged parameters of last 20 checkpoints
- beam=8, length_penalty=1.0
- reported score using sacreBLEU v1.2.3
BLEU Score
- Back-translation (BT) + Dual Learning + Deliberation Network combination performs best
- Agreement Regularization does not add improvement much

Experiment on Larger Corpus

Data
- WMT17 18M + 35M/50M subset selected from 100M UN corpus
  - use Cross-Entropy selection by Moore et al. 2010 and Axelrod et al. 2011
  - use SentVect similarity filtering described in above Data Selection tab with threshold 0.2
Vocab : same
Model
- Transformer Big with 8,192 hidden_size in conv-1 block (bigger than original Transformer Big)
- 300K Adam
- minibatch of 3,500 with 8 GPUs
- same beam, length_penalty and averaging param
BLEU Score
- Base8k (larger model) performs better
- additional corpus selected via SentVect enhances BLEU score best

Human Evaluation Results

Ensembles (Combo-4,5,6) obtains human parity (equivalent score with Reference-HT)
- Reference-HT are human translations without using online translation engines
- Reference-PE are human post-edit output based on Google Translate results
- Reference-WMT are original newstest2017 reference released after WMT17
- Online-A-1710 : Microsoft Translator collected on Oct 2017
- Online-A-1710 : Google Translator collected on Oct 2017

Evaluation Campaigns

Motivated to resolve issues with human evaluation processes
- Annotator variability : what if same annotator provides different results on same data? -> resolved by running three campaigns on same evaluation data, and has seen a near complete overlap
- Data variability : conduct evaluations on completely different subsets of the test data. although, the test data may already be biased..

Human Analysis

preliminary human error analysis on best system

Personal Thoughts

Complete NMT workflow from data selection upto human evaluation and error analysis
impressed on intense experiments, but methods are biased toward MS ideas
human evaluation comparison between baseline model and improved model would have been interesting, see how the error types had been reduced

Link : https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
Authors : Hassan et al. 2018

The text was updated successfully, but these errors were encountered:

kweonwooj added Regularization Data Optimization NMT Training SOTA Memory/Attention Decoder labels Mar 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieving Human Parity on Automatic Chinese to English News Translation #98

Achieving Human Parity on Automatic Chinese to English News Translation #98

kweonwooj commented Mar 26, 2018

Achieving Human Parity on Automatic Chinese to English News Translation #98

Achieving Human Parity on Automatic Chinese to English News Translation #98

Comments

kweonwooj commented Mar 26, 2018

Abstract

Details

Defining Human Parity

Neural Machine Translation

Main Contributions

Data Selection and Filtering

Dual Learning

Iterative Joint Training

Deliberation Network

L2R, R2L Agreement Regularization

System Combination and Re-ranking

NMT Pipeline

Experiments - Benchmark on WMT17

Experiment on Larger Corpus

Human Evaluation Results

Evaluation Campaigns

Human Analysis

Personal Thoughts