This is a PyTorch implementation for EMNLP 2022 main conference paper Low-resource Neural Machine Translation with Cross-modal Alignment.
- Clone this repository:
git clone https://github.com/ictnlp/LNMT-CA
cd LNMT-CA/
- Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:
pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace
- First make a directory to store the dataset:
SRC1=de
SRC2=fr
SRC3=cs
IMG_ROOT=data/img/
cd $IMG_ROOT
mkdir -p raw_images
mkdir $SRC1 $SRC2 $SRC3
- Download the dataset
We have provided the image lists, raw data and BPE processed data. We combined three languages into one source text file, and name it as "mul-en" direction. In the source text, each language has an identifier, such as "[DE], [FR], [CS]".
The Multi30K images can be downloaded here, COCO images can be downloaded here(Choose 2014 train, val and test images), and VizWiz images can be downloaed here
cd raw_images
# Download the image data here
- Extract the image feature
Return to the "LNMT-CA/"
cd CLIP
python extract_vit.py --lang $SRC1 --device cuda:0
python extract_vit.py --lang $SRC2 --device cuda:0
python extract_vit.py --lang $SRC3 --device cuda:0
- Finally, the directory "data" should look like this:
.
├── text
│ ├── raw
│ │ ├──train.de
│ │ ├──train.fr
│ │ ├──......
│ ├── bpe
│ │ ├──train.mul
│ │ ├──train.en
│ │ ├──......
└── img
│ ├── raw_images
│ ├── de_img_name.txt
│ ├── fr_img_name.txt
│ ├── cs_img_name.txt
│ ├── de
│ │ ├── de_vit_clip_avg.npy
│ │ ├── de_vit_clip_0.npy
│ │ ├──......
│ ├── fr
│ ├── cs
- Use
fairseq-preprocess
command to convert the BPE texts into fairseq formats.
TEXT=data/text/bpe/en-mul
fairseq-preprocess --source-lang mul --target-lang en --trainpref ${TEXT}/train --validpref ${TEXT}/val --testpref ${TEXT}/test_2016.de-en,${TEXT}/test_2016.fr-en,${TEXT}/test_2016.cs-en,${TEXT}/test_2017.de-en,${TEXT}/test_2017.fr-en,${TEXT}/test_mscoco.de-en,${TEXT}/test_mscoco.fr-en --destdir data-bin/multilingual-60k --joined-dictionary --workers=20
- Train the model with sentence-level contrastive learning loss for 40-50 epochs:
exp=multilingual-60k
fairseq-train data-bin/multilingual-60k --task translation --source-lang mul --target-lang en --arch transformer --dropout 0.3 --share-all-embeddings \
--image_root data/img/ --de_sen 60136 --fr_sen 60136 --cs_sen 60136 \
--sen_tem 0.007 --token_tem 0.1 \
--sentence_level True --token_level False \
--encoder-layers 6 --decoder-layers 6 \
--encoder-embed-dim 512 --decoder-embed-dim 512 \
--encoder-ffn-embed-dim 1024 --decoder-ffn-embed-dim 1024 \
--encoder-attention-heads 4 --decoder-attention-heads 4 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 2000 \
--lr 0.005 \
--criterion label_smoothed_cross_entropy_contrastive --label-smoothing 0.1 --weight-decay 0.0 \
--max-tokens 4096 \
--update-freq 4 --no-progress-bar --log-format json --log-interval 100 \
--keep-last-epochs 10 \
--save-dir data/checkpoints/$exp \
--ddp-backend=no_c10d \
--patience 10 \
--left-pad-source False | tee experiment/logs/$exp.txt
- Train the model with sentence-level and token-level contrastive learning loss up to 60-70 epochs:
fairseq-train data-bin/multilingual-60k --task translation --source-lang mul --target-lang en --arch transformer --dropout 0.3 --share-all-embeddings \
--image_root data/img/ --de_sen 60136 --fr_sen 60136 --cs_sen 60136 \
--sen_tem 0.007 --token_tem 0.1 \
--sentence_level True --token_level True \
--encoder-layers 6 --decoder-layers 6 \
--encoder-embed-dim 512 --decoder-embed-dim 512 \
--encoder-ffn-embed-dim 1024 --decoder-ffn-embed-dim 1024 \
--encoder-attention-heads 4 --decoder-attention-heads 4 \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 2000 \
--lr 0.005 \
--criterion label_smoothed_cross_entropy_contrastive --label-smoothing 0.1 --weight-decay 0.0 \
--max-tokens 4096 \
--update-freq 4 --no-progress-bar --log-format json --log-interval 100 \
--keep-last-epochs 10 \
--save-dir experiment/checkpoints/$exp \
--ddp-backend=no_c10d \
--patience 10 \
--left-pad-source False | tee experiment/logs/$exp.txt
- If you want to train the model with your own data, please remember to change the "de_sen", "fe_sen", "cs_sen", which means the number of sentence for each language
- Run the following script to average the last 5 checkpoints and evaluate on the three Multi30K test sets, since there are three source languages, [DE] test sets is test, test3, test5; [FR] test sets is test1, test4, test6, [CS] test sets is test2:
mkdir -p results
$MODEL=multilingual-60k
$DATASET=multilingual-60k
sh test_avg.sh $MODEL $DATASET 5
The result will be stored at "LNMT-CA/results/"
In this repository is useful for you, please cite as:
@inproceedings{yang-etal-2022-low,
title = "Low-resource Neural Machine Translation with Cross-modal Alignment",
author = "Yang, Zhe and Fang, Qingkai and Feng, Yang",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
}
If you have any questions, feel free to contact me at yangzhe22s1@ict.ac.cn
.