This repository contains code for EMNLP'23 submission "Bridging the Gap between Synthetic and Authentic Images for Multimodal Machine Translation".
Text-to-image Generation Environment:
conda create -n stable python==3.8
pip install torch==2.0.1
pip install Pillow==9.5.0
pip install transformers==4.27.4
pip install diffusers==0.16.1
pip install scipy==1.10.1
pip install accelerate==0.18.0
Training environment:
conda create -n sammt python==3.6.7
pip install -r requirements.txt
pip install --editable ./
Multi30K texts and images can be downloaded here and here. We get Multi30K text data from fairseq_mmt.
cd fairseq_sammt
git clone https://github.com/multi30k/dataset.git
git clone https://github.com/BryanPlummer/flickr30k_entities.git
# Organize the downloaded dataset
flickr30k
├─ flickr30k-images
├─ test_2017_flickr
└─ test_2017_mscoco
multi30k-dataset
└─ data
└─ task1
├─ tok
└─ image_splits
conda activate stable
python train_stable_diffusion_step50.py train
script parameters:
- dataset: $1: choices=['train','valid','test', 'test1', 'test2']
conda activate sammt
python image_process.py train synth
script parameters:
- dataset:$1: choices=['train','valid','test', 'test1', 'test2']
- synthetic or authentic images: $2: choices=['synth','authe']
The pre-extracted image features can also be downloaded here.
conda activate sammt
bash preprocess.sh
bash train_mmt.sh
# bash translate_mmt.sh $1 $2 $3
bash translate_mmt.sh clip test synth
script parameters:
- image feature: $1: choices=['clip']
- test set: $2: choices=['test', 'test1', 'test2']
- inference with synthetic or authentic images: $3: choices=['synth', 'authe']
This project is built on several open-source repositories/codebases, including: