GitHub - moon23k/BackTranslation: BackTranslation Experiment

NMT Back Translation

A large amount of well-formed data is essential to improve deep learning model performace. However, getting a lot of data in all situations is difficult. One methodology to overcome data scarcity and drive performance improvement is back translation. This repo covers various back translation methodologies and compares performance of each method. For accurate comparison, other variables except for back translation methodologies are fixed.

Methodologies

Amount of Synthetic Data Tuning

How to Generate Synthetic Data Generating As for how to create sentences, there are typically options such as greedy, beam, top-p sampling. Beam and Greedy use the mae to generate sequence, so they tend to produce more precise sentences. On the other hand, top-p sampling has more possibilities for generating sentences compared to the previous two methods, but it is highly likely to generate sentences that are not sophisticated.

In this experiment, Greedy Search is used to account for the trade off between generation speed and accuracy.

Corrupting Synthetic Data

One of the keys to boosting performance through back translation is inducing more difficult problems to be solved with low quality data. Just like the way Language model use Auto Encoding via intentionally corrupted sentence, we also corrupt Synthetic Data to improve Translation Model Performance.

Masking
Delete Random Token

Experimental Setups

The experiment was conducted based on the Korean-English daily conversation dataset provided by aihub. And two pretrained kobart models were used, each of circulus/kobart-trans-ko-en-v2 and circulus/kobart-trans-en-ko-v2.

Dataset desc	Model Config	Training Config
`Data from:` `AI Hub`	`Input Dimension:` `30,000`	`Epochs:` `10`
`Specific Datasets:` `Dialogue, Daily`	`Output Dimension:` `30,000`	`Batch Size:` `32`
`Total Dataset Volumn:` `50,000`	`Embedding Dimension:` `256`	`Learning Rate:` `5e-4`
`Train Dataset Volumn:` `48,000`	`Hidden Dimension:` `512`	`iters_to_accumulate:` `4`
`Valid Dataset Volumn:` `1,000`	`N Layers:` `2`	`Gradient Clip Max Norm:` `1`
`Test Dataset Volumn:` `1,000`	`Drop-out Ratio:` `0.5`	`Apply AMP:` `True`

Results

Training Method	Lang Pair	Training Data Volumn	Best Training Loss	BLEU Score
Vanilla	Ko-En	48,000	-	34.41
Vanilla	En-Ko	48,000	-	11.71
Back Translation	Ko-En	-	-	-
Back Translation	En-Ko	-	-	-
Back + Corruption	Ko-En	-	-	-
Back + Corruption	En-Ko	-	-	-

Reference

Understanding Back-Translation at Scale

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
ckpt		ckpt
data		data
model		model
module		module
README.md		README.md
config.yaml		config.yaml
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ckpt

ckpt

data

data

model

model

module

module

README.md

README.md

config.yaml

config.yaml

run.py

run.py

setup.py

setup.py

Repository files navigation

NMT Back Translation

Methodologies

Experimental Setups

Results

Reference

About

Releases

Packages

Languages

moon23k/BackTranslation

Folders and files

Latest commit

History

Repository files navigation

NMT Back Translation

Methodologies

Experimental Setups

Results

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages