In this repo, we provide the pretrain code and model. It utilizes resolution of 224. If you want to use the high-resolution model (after all of the four training stages), please refer to our hugging face page at https://huggingface.co/Infi-MM/infimm-hd.
Just download the model at https://huggingface.co/lllliuhhhhggg/infimm_pretrain/tree/main. We provide two pretraining models in our paper (only stage1) https://arxiv.org/abs/2403.01487. It is a Flamingo style model, the only difference is we remove the perceiver resampler. We use vit-e and vicuna in our model. These models are pretrained on mmc4, obelisc, coyo238m (sampled from coyo700m), laion115m, laioncoco. Our model's training speed is much faster than LLaVA due to the cross attention information fusion (using same amount of data). Feel free to build something from our pretrained model.
As we utilize the company's training framework, we can not provide the training code directly. So here we give you a demo of the data process and forward pass in demo_forward.py
The copyright of the images belongs to the original authors.
See LICENSE for more information.
https://github.com/baaivision/EVA https://github.com/mlfoundations/open_flamingo
@misc{liu2024infimmhdleapforwardhighresolution,
title={InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding},
author={Haogeng Liu and Quanzeng You and Xiaotian Han and Yiqi Wang and Bohan Zhai and Yongfei Liu and Yunzhe Tao and Huaibo Huang and Ran He and Hongxia Yang},
year={2024},
eprint={2403.01487},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2403.01487},
}