You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reports that the quality of Microsoft's Chinese-to-English machine translation on news sentences is at human parity
uses WMT2017 news translation data
defines how to accurately measure human parity in translation
describes the workflow and various experiments
Details
Defining Human Parity
official definition : If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
use source-based evaluation methodology described in IWSLT17
annotators are shown source text and a candidate translation and asked the questions How accurately does the above candidate text convey the semantics of the source text?, answering this using a slider ranging from 0 to 100 (100 being perfect)
to identify unreliable crowd workers, direct assessment includes artificially degraded translation output randomly
Neural Machine Translation
LSTM, ConvS2S, Transformer are all SoTA models, but choose Transformer as baseline
Main Contributions
Main Techniques used to achieve human parity
Careful data selection and filtering
Dual Learning to utilize the duality of the translation problem
Iterative joint training algorithm described in Zhang et al. 2018 to enhance the effect of monolingual source using Back Translation
Dual Supervised Learning trains primal and dual model simultaneously under regularization to encourage duality in probability distribution
Iterative Joint Training
Deliberation Network
L2R, R2L Agreement Regularization
signals from R2L model can be leveraged to alleviate the exposure bias problem of L2R model and vice versa
System Combination and Re-ranking
combine n-best hypotheses from all systems and train a re-ranker using k-best MIRA (margin-based classification algorithm)
Features used for re-ranking are
original system score, 5-gram LM score, R2L score, Target2Source system re-score, cross-lingual sentence similarity between source and hypothesis
turned out to be that original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity were best features
NMT Pipeline
Train ZhEn, EnZh Transformer model using DUL, DSL with bilingual corpus (multiple models can be trained to leverage ensemble)
Generate back-translation corpus using En & Zh monolingual sentences and pre-trained models from previous step
Train Transformer Model or Deliberation Network with inflated bilingual corpus, use pre-trained model's weight to initialize encoder and first-pass decoder of Deliberation Network
Experiments - Benchmark on WMT17
Data
WMT17 EnZh 18M bilingual pairs. newsdev2017 as dev, newstest2017 as test set.
use LM trained on 18M bilingual pair to filter monolingual sentence from news.crawl and common.crawl
Vocab
Byte Pair Encoding (BPE) of Zh 44k, En 33k
Model
Transformer Big with Tensor2Tensor v1.3.0 open-source
8x M40 GPUs
200k Adam w learning_rate 0.3, decayed with noam schedule
5120 words per batch, checkpoints created every 60 min
results are reported on averaged parameters of last 20 checkpoints
use SentVect similarity filtering described in above Data Selection tab with threshold 0.2
Vocab : same
Model
Transformer Big with 8,192 hidden_size in conv-1 block (bigger than original Transformer Big)
300K Adam
minibatch of 3,500 with 8 GPUs
same beam, length_penalty and averaging param
BLEU Score
Base8k (larger model) performs better
additional corpus selected via SentVect enhances BLEU score best
Human Evaluation Results
Ensembles (Combo-4,5,6) obtains human parity (equivalent score with Reference-HT)
Reference-HT are human translations without using online translation engines
Reference-PE are human post-edit output based on Google Translate results
Reference-WMT are original newstest2017 reference released after WMT17
Online-A-1710 : Microsoft Translator collected on Oct 2017
Online-A-1710 : Google Translator collected on Oct 2017
Evaluation Campaigns
Motivated to resolve issues with human evaluation processes
Annotator variability : what if same annotator provides different results on same data? -> resolved by running three campaigns on same evaluation data, and has seen a near complete overlap
Data variability : conduct evaluations on completely different subsets of the test data. although, the test data may already be biased..
Human Analysis
preliminary human error analysis on best system
Personal Thoughts
Complete NMT workflow from data selection upto human evaluation and error analysis
impressed on intense experiments, but methods are biased toward MS ideas
human evaluation comparison between baseline model and improved model would have been interesting, see how the error types had been reduced
Abstract
Details
Defining Human Parity
If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity
direct assessment
described in WMT17source-based
evaluation methodology described in IWSLT17How accurately does the above candidate text convey the semantics of the source text?
, answering this using a slider ranging from 0 to 100 (100 being perfect)Neural Machine Translation
Transformer
as baselineMain Contributions
Data Selection and Filtering
S
and targetT
Dual Learning
Iterative Joint Training
Deliberation Network
L2R, R2L Agreement Regularization
System Combination and Re-ranking
n-best
hypotheses from all systems and train a re-ranker usingk-best
MIRA (margin-based classification algorithm)original system score, LM score, R2L score, R2L sentence vector similarity and Target2Source sentence similarity
were best featuresNMT Pipeline
Experiments - Benchmark on WMT17
noam
scheduleBT
) + Dual Learning + Deliberation Network combination performs bestExperiment on Larger Corpus
Human Evaluation Results
Combo-4,5,6
) obtains human parity (equivalent score withReference-HT
)Reference-HT
are human translations without using online translation enginesReference-PE
are human post-edit output based on Google Translate resultsReference-WMT
are originalnewstest2017
reference released after WMT17Online-A-1710
: Microsoft Translator collected on Oct 2017Online-A-1710
: Google Translator collected on Oct 2017Evaluation Campaigns
what if same annotator provides different results on same data?
-> resolved by running three campaigns on same evaluation data, and has seen a near complete overlapHuman Analysis
Personal Thoughts
Link : https://www.microsoft.com/en-us/research/uploads/prod/2018/03/final-achieving-human.pdf
Authors : Hassan et al. 2018
The text was updated successfully, but these errors were encountered: