GroundingGPT: Language-Enhanced Multi-modal Grounding Model

Introduction

GroundingGPT is an end-to-end multimodal grounding model that accurately comprehends inputs and possesses robust grounding capabilities across multi modalities,including images, audios, and videos. To address the issue of limited data, we construct a diverse and high-quality multimodal training dataset. This dataset encompasses a rich collection of multimodal data enriched with spatial and temporal information, thereby serving as a valuable resource to foster further advancements in this field. Extensive experimental evaluations validate the effectiveness of the GroundingGPT model in understanding and grounding tasks across various modalities.

More details are available in our project page.

The overall structure of GroundingGPT. Blue boxes represent video as input, while yellow boxes represent image as input.

News

[2024.5] Our paper is accepted to ACL 2024!
[2024.4] Our model is available now!
[2024.3] Our training dataset are available now!
[2024.3] Our code are available now!

Dependencies and Installation

    git clone https://github.com/lzw-lzw/GroundingGPT.git
    cd GroundingGPT
    conda create -n groundinggpt python=3.10 -y
    conda activate groundinggpt
    pip install -r requirements.txt 
    pip install flash-attn --no-build-isolation

Training

Training model preparation

Put the prepared checkpoints in directory ./ckpt.
Prepare ImageBind checkpoint: download imagebind_huge.pth in link and put it under directory ./ckpt/imagebind.
Prepare blip2 checkpoint: download blip2_pretrained_flant5xxl.pth in link and put it under directory ./ckpt.

Training dataset preparation

Please put the prepared checkpoints in file dataset.
Prepare LLaVA, COCO, GQA, OCR-VQA, TextVQA, VisualGenome datasets: follow LLaVA.
Prepare Flickr30K-Entities datasets: follow Flickr30K-Entities.
Prepare Valley datasets: follow Valley.
Prepare DiDeMO datasets: follow DiDeMO.
Prepare ActivityNet Captions datasets: follow ActivityNet Captions.
Prepare Charades-STA datasets: follow Charades-STA.
Prepare VGGSS datasets: follow VGGSS.
Prepare WaveCaps datasets: follow WaveCaps.
Prepare Clotho datasets: follow Clotho.

Training

Inference

Download GroundingGPT-7B and change the model_path in GroundingGPT/lego/serve/cli.py
Use the script to inference
```
  python3 lego/serve/cli.py
```

Demo

Download GroundingGPT-7B and change the model_path in line 141 of GroundingGPT/lego/serve/gradio_web_server.py

Use the script to launch a gradio web demo

  python3 lego/serve/gradio_web_server.py

Statement of Clarification

We hereby clarify that the Language Enhanced Multi-modal Grounding Model (formerly referred to as a LEGO Language Model), which has been modified to GroundingGPT, is in no way associated with or endorsed by the LEGO Group. There is no investment, collaboration, or any other form of relationship between the LEGO Group and our model previously using the LEGO name. We kindly request that any media or third-party entities that have published or disseminated inaccurate or misleading reports regarding this model promptly correct or remove the misinformation. Your immediate attention to this matter would be greatly appreciated. We deeply apologize for any confusion, inconvenience, or harm caused by these misconducts to the LEGO Group.

Acknowledgement

Citation

If you find GroundingGPT useful for your your research and applications, please cite using this BibTeX:

@article{li2024lego,
  title={LEGO: Language Enhanced Multi-modal Grounding Model},
  author={Li, Zhaowei and Xu, Qi and Zhang, Dong and Song, Hang and Cai, Yiqing and Qi, Qi and Zhou, Ran and Pan, Junting and Li, Zefeng and Vu, Van Tu and others},
  journal={arXiv preprint arXiv:2401.06071},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
images		images
lego		lego
scripts		scripts
video_llama		video_llama
LICENSE		LICENSE
README.md		README.md
Statement of Clarification.md		Statement of Clarification.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GroundingGPT: Language-Enhanced Multi-modal Grounding Model

Introduction

News

Dependencies and Installation

Training

Training model preparation

Training dataset preparation

Training

Inference

Demo

Statement of Clarification

Acknowledgement

Citation

About

Releases

Packages

Languages

License

lzw-lzw/GroundingGPT

Folders and files

Latest commit

History

Repository files navigation

GroundingGPT: Language-Enhanced Multi-modal Grounding Model

Introduction

News

Dependencies and Installation

Training

Training model preparation

Training dataset preparation

Training

Inference

Demo

Statement of Clarification

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages