Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs
tgdoc-7b-finetune-224 Code: gxqt
tgdoc-13b-finetune-224 Code: gxqt
Following the installation of llava
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation
some package version
deepspeed==0.9.5
peft==0.4.0
transformers==4.31.0
accelerate==0.21.0
bitsandbytes==0.41.0
Note: If you use a GTX 3090 machine, you need to enable deepspeed, and you must maintain the CUDA version at 11.7, otherwise the loss will not be optimized during the training process. This issue does not occur with the A100 when not using deepspeed.
Download the llava and llavar dataset
Our full datasets can be download from here Code: gxqt
Modify the configuration file in llava/data/config.py
.
For 8-3090-24G, run bash scripts/pretrain_deep.sh
For 8-A100-40G, run bash scripts/pretrain_20.sh
For 8-3090-24G, run bash scripts/finetune_deep.sh
For 8-A100-40G, run bash scripts/finetune_20.sh
modify the model-path
param to use different models.
bash scripts/cli2_a16.sh
We used the widely used MultimodalOCR to validate our method. As noted in the paper, we added "Support your reasoning with the coordinates [xmin, ymin, xmax, ymax]" at the end of each question.
qualitative results on the validation set.
examples1
examples2
@article{wang2023towards,
title={Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs},
author={Wang, Yonghui and Zhou, Wengang and Feng, Hao and Zhou, Keyi and Li, Houqiang},
journal={arXiv preprint arXiv:2311.13194},
year={2023}
}
The code are bring heavily from LLaVA, Thanks for their great work.
Note: We have simplified the code in LLaVA to make it easier to read.