Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling

This is the official repository for DIMTDA framework and DoTA dataset introduced by the following paper: Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling (NAACL 2024 Main)

📜 Abstract

Text image machine translation (TIMT) is a task that translates source texts embedded in the image to target translations. The existing TIMT task mainly focuses on text-line-level images. In this paper, we extend the current TIMT task and propose a novel task, Document Image Machine Translation to Markdown (DIMT2Markdown), which aims to translate a source document image with long context and complex layout structure to markdownformatted target translation. We also introduce a novel framework, Document Image Machine Translation with Dynamic multi-pre-trained models Assembling (DIMTDA). A dynamic model assembler is used to integrate multiple pre-trained models to enhance the model’s understanding of layout and translation capabilities. Moreover, we build a novel large-scale Document image machine Translation dataset of ArXiv articles in markdown format (DoTA), containing 126K image-translation pairs. Extensive experiments demonstrate the feasibility of end-to-end translation of rich-text document images and the effectiveness of DIMTDA.

The diagram of the proposed DIMTDA.

The output samples of DIMTDA. (a) and (c) are the original document images. (b) and (d) are the output translated texts in markdown format after rendering.

🗂️ DoTA dataset

In addition to the 126K samples mentioned in the paper, we provide all 139K samples that have not been filtered. Each sample contains original English image, transcripted English mmd file and translated Chinese/French/German mmd file. Samples used in the paper are listed in a json file.

The DoTA dataset can be downloaded from this huggingface link. Please send an email to liangyupu2021@ia.ac.cn to inform your name and affiliated institution after submitting the download application on Hugging Face.

🛠️ DIMTDA

1. Requirements

python==3.10.13
pytorch==1.13.1
transformers==4.33.2
sacrebleu==2.3.1
jieba==0.42.1
zss==1.2.0

2. Download pre-trained models

Download pre-trained DiT model from microsoft/dit-base.

Download pre-trained Nougat model from facebook/nougat-small.

The file directory structure is as follows:

DIMTDA
├── codes
├── DoTA_dataset
├── pretrained_models
└── utils

3. Pre-train a text translation model

bash pretrain_trans.sh

4. Finetune DIMTDA

bash finetune_dimtda.sh

5. Inference

Before running the script, you need to replace the ~/anaconda3/envs/your_env_name/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py file with the ./utils/modeling_bert.py file.

bash inference.sh

6. Evaluate

bash evaluate.sh

🖻 More samples

The output samples of DIMTDA. For each image pair, the left one is the input document image, and the right one is the output translations in markdown format after rendering.

🙏🏻 Acknowledgement

We thank @lukas-blecher and facebookresearch/nougat project for providing dataset construction method and pre-trained model. We also thank microsoft/unilm project for providing pre-trained model.

✍🏻 Citation

If you want to cite our paper, please use the following BibTex entries:

@inproceedings{liang-etal-2024-document,
    title = "Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling",
    author = "Liang, Yupu  and
      Zhang, Yaping  and
      Ma, Cong  and
      Zhang, Zhiyang  and
      Zhao, Yang  and
      Xiang, Lu  and
      Zong, Chengqing  and
      Zhou, Yu",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.392",
    pages = "7077--7088",
}

If you have any question, feel free to contact liangyupu2021@ia.ac.cn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling

📜 Abstract

🗂️ DoTA dataset

🛠️ DIMTDA

1. Requirements

2. Download pre-trained models

3. Pre-train a text translation model

4. Finetune DIMTDA

5. Inference

6. Evaluate

🖻 More samples

🙏🏻 Acknowledgement

✍🏻 Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
codes		codes
images		images
utils		utils
LICENSE		LICENSE
README.md		README.md
evaluate.sh		evaluate.sh
finetune_dimtda.sh		finetune_dimtda.sh
inference.sh		inference.sh
pretrain_trans.sh		pretrain_trans.sh

License

liangyupu/DIMTDA

Folders and files

Latest commit

History

Repository files navigation

Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling

📜 Abstract

🗂️ DoTA dataset

🛠️ DIMTDA

1. Requirements

2. Download pre-trained models

3. Pre-train a text translation model

4. Finetune DIMTDA

5. Inference

6. Evaluate

🖻 More samples

🙏🏻 Acknowledgement

✍🏻 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages