InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

This repository includes the InstructDoc dataset introduced by the following paper: Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. "InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions". In Proc. of AAAI. 2024.

We introduce InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets.

Get Started

1. Download datasets

sh download.sh

This script helps you to download most of the datasets automatically. For some datasets, due to the license issue and downloading restrictions, you need to manually download them by following the instructions in download_scripts/README.md

2. Preprocess datasets

sh process_data.sh API_KEY

This script helps you to process all the datasets. To extract OCR information from document images, we used Google Vision API and set the variables "API_KEY" to the API key obtained from Google Cloud Platform. To get one visit the link.

If you encounter the FileNotFoundError while processing the datasets, please set the variable --input_data_dir in data_processors to your dataset directory name correctly.

3. Merge preprocessed datasets

python merge_datasets --max_samples 5000 --input_data_dir processed_data --save_dir ./

We randomly sampled a maximum of 5000 instances for each held-in dataset. After processing datasets, you can obtain JSON files with the following format. If the dataset provides multiple images per instance (e.g., SlideVQA), we add "_list" into the fields, including "image", "ocr", and "bboxes".

   {
      "dataset_name": dataset name,
      "id": identification of the instance,
      "image" or "image_list": image path,
      "ocr" or "ocr_list": ocr text,
      "bboxes" or "bboxes_list": [x1, y1, x2, y2, w, h],
      "conversations": [
        {'user': 'human', 'value': randomly sampled instruction}
        {'user': 'gpt', 'value': answer}
      ]
    }

Citation

You can cite it as follows:

@inproceedings{InstructDoc2024,
  author    = {Ryota Tanaka and
               Taichi Iki and
               Kyosuke Nishida and
               Kuniko Saito and
               Jun Suzuki},
  title     = {InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions},
  booktitle = {AAAI},
  year      = {2024}
}

If you have any questions about the paper and repository, feel free to contact Ryota Tanaka (ryota.tanaka[at]ntt.com) or open an issue!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_preprocessors		data_preprocessors
download_scripts		download_scripts
LICENSE		LICENSE
README.md		README.md
download.sh		download.sh
example.png		example.png
instructdoc_instructions.xlsx		instructdoc_instructions.xlsx
merge_datasets.py		merge_datasets.py
process_data.sh		process_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_preprocessors

data_preprocessors

download_scripts

download_scripts

LICENSE

LICENSE

README.md

README.md

download.sh

download.sh

example.png

example.png

instructdoc_instructions.xlsx

instructdoc_instructions.xlsx

merge_datasets.py

merge_datasets.py

process_data.sh

process_data.sh

Repository files navigation

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Get Started

1. Download datasets

2. Preprocess datasets

3. Merge preprocessed datasets

Citation

About

Releases

Packages

Languages

License

nttmdlab-nlp/InstructDoc

Folders and files

Latest commit

History

Repository files navigation

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Get Started

1. Download datasets

2. Preprocess datasets

3. Merge preprocessed datasets

Citation

About

Resources

License

Stars

Watchers

Forks

Languages