LNMT-CA

This is a PyTorch implementation for EMNLP 2022 main conference paper Low-resource Neural Machine Translation with Cross-modal Alignment.

Training a Model on De,Fr,Cs -> En

Enviroment Configuration

Clone this repository:

git clone https://github.com/ictnlp/LNMT-CA
cd LNMT-CA/

Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:

pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace

Data Preparation

First make a directory to store the dataset:

SRC1=de
SRC2=fr
SRC3=cs

IMG_ROOT=data/img/
cd $IMG_ROOT

mkdir -p raw_images
mkdir $SRC1 $SRC2 $SRC3

Download the dataset

We have provided the image lists, raw data and BPE processed data. We combined three languages into one source text file, and name it as "mul-en" direction. In the source text, each language has an identifier, such as "[DE], [FR], [CS]".

The Multi30K images can be downloaded here, COCO images can be downloaded here(Choose 2014 train, val and test images), and VizWiz images can be downloaed here

cd raw_images

# Download the image data here

Extract the image feature

Return to the "LNMT-CA/"

cd CLIP
python extract_vit.py --lang $SRC1 --device cuda:0
python extract_vit.py --lang $SRC2 --device cuda:0
python extract_vit.py --lang $SRC3 --device cuda:0

Finally, the directory "data" should look like this:

.
├── text
│   ├── raw
│   │    ├──train.de
│   │    ├──train.fr
│   │    ├──......
│   ├── bpe
│   │    ├──train.mul
│   │    ├──train.en
│   │    ├──......
└── img
│   ├── raw_images
│   ├── de_img_name.txt
│   ├── fr_img_name.txt
│   ├── cs_img_name.txt
│   ├── de
│   │    ├── de_vit_clip_avg.npy
│   │    ├── de_vit_clip_0.npy
│   │    ├──......
│   ├── fr
│   ├── cs

Data Preprocess

Use fairseq-preprocess command to convert the BPE texts into fairseq formats.

TEXT=data/text/bpe/en-mul
fairseq-preprocess --source-lang mul --target-lang en --trainpref ${TEXT}/train --validpref ${TEXT}/val  --testpref ${TEXT}/test_2016.de-en,${TEXT}/test_2016.fr-en,${TEXT}/test_2016.cs-en,${TEXT}/test_2017.de-en,${TEXT}/test_2017.fr-en,${TEXT}/test_mscoco.de-en,${TEXT}/test_mscoco.fr-en --destdir data-bin/multilingual-60k  --joined-dictionary --workers=20

Training

Train the model with sentence-level contrastive learning loss for 40-50 epochs:

exp=multilingual-60k
fairseq-train data-bin/multilingual-60k --task translation  --source-lang mul --target-lang en --arch transformer --dropout 0.3  --share-all-embeddings  \
    --image_root data/img/ --de_sen 60136 --fr_sen 60136 --cs_sen 60136 \
    --sen_tem 0.007 --token_tem 0.1 \
    --sentence_level True --token_level False \
    --encoder-layers 6 --decoder-layers 6 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 1024 --decoder-ffn-embed-dim 1024 \
    --encoder-attention-heads 4 --decoder-attention-heads 4 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 2000 \
    --lr 0.005 \
    --criterion label_smoothed_cross_entropy_contrastive --label-smoothing 0.1 --weight-decay 0.0 \
    --max-tokens 4096 \
    --update-freq 4 --no-progress-bar --log-format json --log-interval 100 \
    --keep-last-epochs 10 \
    --save-dir data/checkpoints/$exp \
    --ddp-backend=no_c10d \
    --patience 10 \
    --left-pad-source False | tee experiment/logs/$exp.txt

Train the model with sentence-level and token-level contrastive learning loss up to 60-70 epochs:

fairseq-train data-bin/multilingual-60k --task translation  --source-lang mul --target-lang en --arch transformer --dropout 0.3  --share-all-embeddings  \
    --image_root data/img/ --de_sen 60136 --fr_sen 60136 --cs_sen 60136 \
    --sen_tem 0.007 --token_tem 0.1 \
    --sentence_level True --token_level True \
    --encoder-layers 6 --decoder-layers 6 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 1024 --decoder-ffn-embed-dim 1024 \
    --encoder-attention-heads 4 --decoder-attention-heads 4 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 2000 \
    --lr 0.005 \
    --criterion label_smoothed_cross_entropy_contrastive --label-smoothing 0.1 --weight-decay 0.0 \
    --max-tokens 4096 \
    --update-freq 4 --no-progress-bar --log-format json --log-interval 100 \
    --keep-last-epochs 10 \
    --save-dir experiment/checkpoints/$exp \
    --ddp-backend=no_c10d \
    --patience 10 \
    --left-pad-source False | tee experiment/logs/$exp.txt

If you want to train the model with your own data, please remember to change the "de_sen", "fe_sen", "cs_sen", which means the number of sentence for each language

Evaluate

Run the following script to average the last 5 checkpoints and evaluate on the three Multi30K test sets, since there are three source languages, [DE] test sets is test, test3, test5; [FR] test sets is test1, test4, test6, [CS] test sets is test2:

mkdir -p results

$MODEL=multilingual-60k
$DATASET=multilingual-60k

sh test_avg.sh $MODEL $DATASET 5

The result will be stored at "LNMT-CA/results/"

Citation

In this repository is useful for you, please cite as:

@inproceedings{yang-etal-2022-low,
    title = "Low-resource Neural Machine Translation with Cross-modal Alignment",
    author = "Yang, Zhe and Fang, Qingkai and Feng, Yang",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
}

Contact

If you have any questions, feel free to contact me at yangzhe22s1@ict.ac.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
CLIP		CLIP
config		config
data		data
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
setup.py		setup.py
test_avg.sh		test_avg.sh
train.py		train.py

License

ictnlp/LNMT-CA

Folders and files

Latest commit

History

Repository files navigation

LNMT-CA

Training a Model on De,Fr,Cs -> En

Enviroment Configuration

Data Preparation

Data Preprocess

Training

Evaluate

Citation

Contact

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages