Skip to content

The official repository of "Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling"

License

Notifications You must be signed in to change notification settings

liangyupu/DIMTDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling

This is the official repository for DIMTDA framework and DoTA dataset introduced by the following paper: Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling (NAACL 2024 Main)

📜 Abstract

Text image machine translation (TIMT) is a task that translates source texts embedded in the image to target translations. The existing TIMT task mainly focuses on text-line-level images. In this paper, we extend the current TIMT task and propose a novel task, Document Image Machine Translation to Markdown (DIMT2Markdown), which aims to translate a source document image with long context and complex layout structure to markdownformatted target translation. We also introduce a novel framework, Document Image Machine Translation with Dynamic multi-pre-trained models Assembling (DIMTDA). A dynamic model assembler is used to integrate multiple pre-trained models to enhance the model’s understanding of layout and translation capabilities. Moreover, we build a novel large-scale Document image machine Translation dataset of ArXiv articles in markdown format (DoTA), containing 126K image-translation pairs. Extensive experiments demonstrate the feasibility of end-to-end translation of rich-text document images and the effectiveness of DIMTDA.

The diagram of the proposed DIMTDA.

The output samples of DIMTDA. (a) and (c) are the original document images. (b) and (d) are the output translated texts in markdown format after rendering.

🗂️ DoTA dataset

In addition to the 126K samples mentioned in the paper, we provide all 139K samples that have not been filtered. Each sample contains original English image, transcripted English mmd file and translated Chinese/French/German mmd file. Samples used in the paper are listed in a json file.

The DoTA dataset can be downloaded from this huggingface link. Please send an email to liangyupu2021@ia.ac.cn to inform your name and affiliated institution after submitting the download application on Hugging Face.

🛠️ DIMTDA

1. Requirements

python==3.10.13
pytorch==1.13.1
transformers==4.33.2
sacrebleu==2.3.1
jieba==0.42.1
zss==1.2.0

2. Download pre-trained models

Download pre-trained DiT model from microsoft/dit-base.

Download pre-trained Nougat model from facebook/nougat-small.

The file directory structure is as follows:

DIMTDA
├── codes
├── DoTA_dataset
├── pretrained_models
└── utils

3. Pre-train a text translation model

bash pretrain_trans.sh

4. Finetune DIMTDA

bash finetune_dimtda.sh

5. Inference

Before running the script, you need to replace the ~/anaconda3/envs/your_env_name/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py file with the ./utils/modeling_bert.py file.

bash inference.sh

6. Evaluate

bash evaluate.sh

🖻 More samples

The output samples of DIMTDA. For each image pair, the left one is the input document image, and the right one is the output translations in markdown format after rendering.

🙏🏻 Acknowledgement

We thank @lukas-blecher and facebookresearch/nougat project for providing dataset construction method and pre-trained model. We also thank microsoft/unilm project for providing pre-trained model.

✍🏻 Citation

If you want to cite our paper, please use the following BibTex entries:

@inproceedings{liang-etal-2024-document,
    title = "Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling",
    author = "Liang, Yupu  and
      Zhang, Yaping  and
      Ma, Cong  and
      Zhang, Zhiyang  and
      Zhao, Yang  and
      Xiang, Lu  and
      Zong, Chengqing  and
      Zhou, Yu",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.392",
    pages = "7077--7088",
}

If you have any question, feel free to contact liangyupu2021@ia.ac.cn.

About

The official repository of "Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published