Skip to content

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"

License

Notifications You must be signed in to change notification settings

penghao-wu/vstar

Repository files navigation

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Teaser

Contents:

  1. Getting Started
  2. Demo
  3. Benchmark
  4. Evaluation
  5. Training
  6. License
  7. Citation
  8. Acknowledgement

Getting Started

Installation

conda create -n vstar python=3.10 -y
conda activate vstar
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
export PYTHONPATH=$PYTHONPATH:path_to_vstar_repo

Pre-trained Model

The VQA LLM can be downloaded here.
The visual search model can be downloaded here.

Training Dataset

The alignment stage of the VQA LLM uses the 558K subset of the LAION-CC-SBU dataset used by LLaVA which can be downloaded here.

The instruction tuning stage requires several instruction tuning subsets which can be found here.

The instruction tuning data requires images from COCO-2014, COCO-2017, and GQA. After downloading them, organize the data following the structure below

├── coco2014
│   └── train2014
├── coco2017
│   └── train2017
└── gqa
     └── images

Demo

You can launch a local Gradio demo after the installation by running python app.py. Note that the pre-trained model weights will be automatically downloaded if you have not downloaded them before.

You are expected to see the web page below:

demo

Benchmark

Our V*Bench is available here. The benchmark contains folders for different subtasks. Within each folder is a list of image files and annotation JSON files. The image and annotations files are paired according to the filename. The format of the annotation files is:

{
  "target_object": [] // A list of target object names
  ,
  "bbox": [] // A list of target object coordinates in <x,y,w,h>
  ,
  "question": "",
  "options": [] // A list of options, the first one is the correct option by default
}

Evaluation

To evaluate our model on the V*Bench benchmark, run

python vstar_bench_eval.py --benchmark-folder PATH_TO_BENCHMARK_FOLDER

To evaluate our visual search mechanism on the annotated targets from the V*Bench benchmark, run

python visual_search.py --benchmark-folder PATH_TO_BENCHMARK_FOLDER

The detailed evaluation results of our model can be found here.

Training

The training of the VQA LLM model includes two stages.

For the pre-training stage, enter the LLaVA folder and run

sh pretrain.sh

For the instruction tuning stage, enter the LLaVA folder and run

sh finetune.sh

For the training data preparation and training procedures of our visual search model, please check this doc.

License

This project is under the MIT license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@article{vstar,
  title={V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs},
  author={Penghao Wu and Saining Xie},
  journal={arXiv preprint arXiv:2312.14135},
  year={2023}
}

Acknowledgement

  • This work is built upon the LLaVA and LISA.

About

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published