## Quality Estimation for Machine/Human Translation (Unsupervised Approach)

**Author:** Jessica Silva

**Keywords:** Quality Estimation; Machine Translation; Word-level; Sentence-level

**Date:** 20/07/2020

## SUMMARY <a class="tocSkip">

### Context <a class="tocSkip">

The aim of this research is to understand and evaluate the quality estimation task for machine/human translation. The goal of this task is to assess the quality of a translation without access to reference translations.

### Questions <a class="tocSkip">

1) Which strategies can be used for measuring quality of translations? In which cases can they be applied? 

2) Which kind of data and how much data is necessary to train this approaches? 

3) Gains with the new approach.

### Main Outcomes (on going) <a class="tocSkip">

1) Which strategies can be used for measuring quality of machine translations? In which cases can they be applied? 

-

2) Which kind of data and how much data is necessary to train this approaches?

-

3) Gains with the new approach.

-

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc">
	<ul class="toc-item">
		<li>
			<span><a href="#Problem-Statement" data-toc-modified-id="Problem-Statement-1">
				<span class="toc-item-num">1&nbsp;&nbsp;</span>Problem Statement</a>
			</span>
		</li>
		<li>
			<span><a href="#Experimental-Setup" data-toc-modified-id="Experimental-Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Experimental Setup</a></span>
		</li>
		<li>
			<span><a href="#Multilingual-sentence-embeddings" data-toc-modified-id="Multilingual Sentence Embeddings-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Multilingual Sentence Embeddings</a></span>
			<ul class="toc-item">
				<li>
					<span><a href="#Results" data-toc-modified-id="Results-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Dataset</a></span>
				</li>
                <li>
					<span><a href="#TrainPredict" data-toc-modified-id="TrainPredict-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Train and Predict</a></span>
				</li>
                <li>
                    <span><a href="#Evaluation" data-toc-modified-id="Evaluation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Evaluation</a></span>
                </li>
			</ul>
		</li>
		<li>
			<span><a href="#Benchmarks" data-toc-modified-id="Benchmarks-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Benchmarks</a></span>
		</li>
		<li>
			<span><a href="#Future-Work" data-toc-modified-id="Future-Work-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Future Work</a></span>
		</li>
	</ul>
</div>

## 1 - Problem Statement
The aim of this research is to understand and evaluate the quality estimation task for machine/human translation. The goal of this task is to assess the quality of a translation without access to reference translations.

## 2 - Experimental Setup 

Before being able to run Quality Estimation Tutorial, there is a small setup required. Please check the README.md before continue

# 3 - Multilingual Sentence Embeddings

The Multilingual Sentence Embeddings resource can measure the quality in the sentence level:

* **Sentence-level**:
The goal of the Sentence-level QE task using multilingual sentence embeddings is to predict the quality of the whole translated sentence, based on the semantic similarity score between two sentences in different languages.

### Architecture

**Sentence Transformer - Knowledge Distillation approach**

The architecture consists of two models, the **Teacher** and the **Student**. The Teacher model produces sentence embeddings from source language. Using translated sentences, the Student model needs to mimic the teacher and generate sentence embeddings in the target language. This training will produce an alignment of vector spaces (source and target) making it possible to measure the cosine distance between them.

This models works with transfer learning and needs to be initialize with some pretrained language models as **BERT**, **GPT 2 or 3**, **XLM**, **XLNet**, **RoBERTa** and so on. We follow the [Paper](https://arxiv.org/abs/2004.09813) and kept **SBERT (english model)** initializing the teacher model and **XLM-R (multilingual model with 100 languages)** initializing the student model.

Some features:

* allows to create multilingual versions from previously monolingual models

* an easy and efficient method to extend existing sentence embedding models to new languages

* the training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence

<img src='images/knowledge-distillation.png' width='700'>

## 4.1 - Dataset

The Sentence Transformer is trained to learn multilingual sentence embeddings and can be fine-tunned to learn a specific task.

### Sentence Transformer data without fine-tuning

* **Format**: ( _src, tgt_ )

_src: Sentences in the source language_

_tgt: Translated sentences in the target language_

* **tags**: _no tags (raw data)_

### Sentence Transformer data with fine-tuning

The fine-tuning can be done in the Semantic Textual Similarity task, with a parallel dataset annotated with semantic similarity scores:

* **Format**: ( _src, tgt_ )

_src: Sentences in the source language_

_tgt: Translated sentences in the target language_

* **tags**: _semantic similarity score_

## 4.2 - Train and Predict

In [1]:
import os
import logging

import torch
import pandas as pd
from ipywidgets import interact, fixed, Textarea
from sentence_transformers import SentenceTransformer, util, evaluation, LoggingHandler

from src import utils

[nltk_data] Downloading package punkt to /home/jessica/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Here we will load a public pretrained student model that was extended with 13 languages.

**distiluse-base-multilingual-cased**: Supported languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. Model is based on DistilBERT-multi-lingual.

In [2]:
model = SentenceTransformer('distiluse-base-multilingual-cased')

Calculating the cosine distance between sentences:

In [3]:
source = ['Eric has 22']
source_embedding = model.encode(source, convert_to_tensor=True)

In [4]:
target = ['Eric ha 22']
target_embedding = model.encode(target, convert_to_tensor=True)

In [5]:
score = util.pytorch_cos_sim(source_embedding, target_embedding)[0]
print("(Score: %.4f)" % (score))

(Score: 0.9576)


Interactive demo:

In [6]:
SOURCE = Textarea(value=source[0], layout={'width': '90%'})
TARGET = Textarea(value=target[0], layout={'width': '90%'})
_interact = interact(utils.SentenceTransformerViz, model=fixed(model), source=SOURCE, target=TARGET)

interactive(children=(Textarea(value='Eric has 22', description='source', layout=Layout(width='90%')), Textare…

### Adding a new language

Here we extended a new student model with 3 similar languages: Spanish, European-Portuguese and Brazilian-Portuguese. The difference to the previous model is the addition of the EN-Brazilian-Portuguese parallel dataset (413,689 sentences).

For initialization, we still kept SBERT (english model) as a teacher model and XLM-R (multilingual model with 100 languages) as a student model.

You should download the trained model folder [here](https://definedcrowd-my.sharepoint.com/:u:/p/jessica_silva/ETITnnE5olxPpzh-5_iAwRgBA2T4H2pFLqVi_ijF2r5ESA?e=r1fslp) and extract it on `models/`.

In [12]:
new_model = SentenceTransformer('../models/trained_model-en-es-pt-ptbr')

In [13]:
SOURCE = Textarea(layout={'width': '90%'})
TARGET = Textarea(layout={'width': '90%'})
_interact = interact(utils.SentenceTransformerViz, model=fixed(new_model), source=SOURCE, target=TARGET)

Score: <span style='color:green'>1.0000 </span>

## 4.3 - Evaluation

### Translation Retrieval

One way to assess the quality of our model is through a Translation Retrieval task. We can use a parallel testset and perform a target search from the source, using the cosine distance.

Loading the model

Supported languages: **Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish**.

In [7]:
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

In [8]:
model = "distiluse-base-multilingual-cased"
inference_batch_size = 32

model = SentenceTransformer(model)

### Test Data (Quality Inspection)

The testset used is parallel data from Apple, coming from the Quality Inspection stage made by DefinedCrowd. So far, we have parallel data in the following languages: **english-arabic (en-ar), english-italian (en-it), english-japanese (en-jp), english-korean (en-ko), english-russian (en-ru) and russian-french (ru-fr)**.

In [9]:
def get_data(language_pair):
    path = "../data/processed/" + language_pair + "/"
    src_file = pd.read_csv(path + 'source.txt', delimiter="\n", header=None, encoding='utf-8', engine='python')
    trg_file = pd.read_csv(path + 'target.txt', delimiter="\n", header=None, encoding='utf-8', engine='python')
    gold_file = pd.read_csv(path + 'gold.txt', delimiter="\n", header=None, encoding='utf-8', engine='python')

    src_sentences = src_file[0].tolist()
    trg_sentences = trg_file[0].tolist()
    gold_sentences = gold_file[0].tolist()

    return src_sentences, trg_sentences, gold_sentences

For each sentence pair, we check if they have the shortest cosine distance between source embedding and target embedding. To do this, for each src_sentences[i] we check if trg_sentences[j] has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better).

In [10]:
src_sentences, trg_sentences, gold_sentences = get_data("en-ko")
logging.info(str(len(src_sentences))+" sentence pairs")
dev_trans_acc = evaluation.TranslationEvaluator(src_sentences, trg_sentences, batch_size=inference_batch_size, print_wrong_matches=True)
dev_trans_acc(model)

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/en-ko/source.txt'

<img src='images/translation-retriveal-evaluation.png' width='700'>

### Semantic similarity score

Another way to assess the quality of our model is checking if the cosine distance between the source and the target is lower than the source and the gold target (Post-edited by humans).

Convert the source, target and gold to embeddings and calculate the cosine distance between the pairs.

In [22]:
scores = []
scores_gold = []
src_sentences, trg_sentences, gold_sentences = get_data('en-it')
for src_sentence, trg_sentence, gold_sentence in zip(src_sentences, trg_sentences, gold_sentences):
    source_embedding = model.encode(src_sentence, convert_to_tensor=True)
    target_embedding = model.encode(trg_sentence, convert_to_tensor=True)
    gold_embedding = model.encode(gold_sentence, convert_to_tensor=True)
    scores.append(util.pytorch_cos_sim(source_embedding, target_embedding)[0])
    scores_gold.append(util.pytorch_cos_sim(source_embedding, gold_embedding)[0])

For each source and target pair, we check if they have a lower cosine distance than source and gold target (post-edited by humans). To do this: cosine_distance(src_sentences[i] and trg_sentences[i]) <= cosine_distance(src_sentences[i] and gold_trg_sentences[i]). If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better).

In [23]:
results = zip(range(len(scores)), scores, scores_gold)
results = sorted(results, key=lambda x: x[1], reverse=False)

correct_score = 0
for idx, score, score_gold in results:
    if score_gold >= score:
        correct_score += 1

    print('Source: ', src_sentences[idx])
    print('Target: ', trg_sentences[idx])
    print('Gold: ', gold_sentences[idx])
    print("(Score: %.4f)" % (score))
    print("(Score Gold: %.4f)" % (score_gold))
    print("*******************************************************************************************")

acc_score = correct_score / len(results)
logging.info(str(len(results)) + " sentence pairs")
logging.info("Accuracy: {:.2f}".format(acc_score * 100))

Source:  Do you want to have a desert
Target:  Gradiresti un dessert?
Gold:  Vuoi avere un deserto?
(Score: 0.3461)
(Score Gold: 0.8793)
*******************************************************************************************
Source:  Do you want to have a desert
Target:  Gradiresti un dessert?
Gold:  Vuoi avere un deserto?
(Score: 0.3461)
(Score Gold: 0.8793)
*******************************************************************************************
Source:  Do you want to have a desert
Target:  Gradiresti un dessert?
Gold:  Vuoi avere un deserto?
(Score: 0.3461)
(Score Gold: 0.8793)
*******************************************************************************************
Source:  Do you want to have a desert
Target:  Gradiresti un dessert?
Gold:  Vuoi avere un deserto?
(Score: 0.3461)
(Score Gold: 0.8793)
*******************************************************************************************
Source:  Source Text
Target:  Target Text
Gold:  Suggested Target Text
(Score: 0.534

<img src='images/semantic-similarity-target-vs-post-editing.png' width='700'>

### Test Data (Quality Validation)

The Quality Validation is done on the platform and the crowd member needs to answer three questions:

* The source text is intelligible?
* Is the meaning of the source conveyed?
* Is the translation fluent and sound natural in the target language?

## Readings

I suggest the following complementary readings on Quality Estimation

* A really good [book](https://www.morganclaypool.com/doi/abs/10.2200/S00854ED1V01Y201805HLT039) of Quality Estimation from Lucia Specia et al.
* [Predictor-Estimator architecture paper](https://www.aclweb.org/anthology/W17-4763.pdf)
* The [main conference on Machine Translation](http://www.statmt.org/wmt20/) and the [Quality Estimation](http://www.statmt.org/wmt20/quality-estimation-task.html) shared task.

## References

[1] [Quality Estimation for Machine Translation](https://www.morganclaypool.com/doi/abs/10.2200/S00854ED1V01Y201805HLT039) 

[2] [Unsupervised Quality Estimation for Neural Machine Translation](https://arxiv.org/abs/2005.10608)

[3] [Unbabel's Participation in the WMT19 Translation Quality Estimation Shared Task](https://arxiv.org/abs/1907.10352)

[4] [OpenKiwi: An Open Source Framework for Quality Estimation](https://arxiv.org/abs/1902.08646)

[5] [Quality In, Quality Out: Learning from Actual Mistakes](https://fredblain.org/transfer-learning-qe.html)

[6] [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/pdf/2004.09813.pdf)

[7] [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)

[8] [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/pdf/1812.10464.pdf)

[9] [BERTSCORE: EVALUATING TEXT GENERATION WITH BERT](https://arxiv.org/pdf/1904.09675.pdf)