BLIP

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

How to use it?

Use the model

from mmpretrain import inference_model

result = inference_model('blip-base_3rdparty_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a puppy and a cat sitting on a blanket'}

Test Command

Prepare your dataset according to the docs.

Test:

python tools/test.py configs/blip/blip-base_8xb32_caption.py https://download.openmmlab.com/mmclassification/v1/blip/blip-base_3rdparty_coco-caption_20230419-a5b71af3.pth

Models and results

Image Caption on COCO

Model	Params (M)	BLEU-4	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	40.12	132.82	config	model

Image Caption on NoCaps

Model	Params (M)	SPICE	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	14.69	109.12	config	model

Image Caption on Flickr30k

Model	Params (M)	SPICE	CIDER	Config	Download
`blip-base_3rdparty_caption`*	223.97	15.58	68.89	config	model

Visual Grounding on RefCOCO

Model	Params (M)	Accuracy (testA)	Accuracy (testB)	Config	Download
`blip-base_8xb16_refcoco`	498.49	86.14	77.33	config	model \| log

Visual Question Answering on VQAv2

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	78.20	config	model

Visual Question Answering on OK-VQA

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	40.59#	config	model

Visual Question Answering on OCR-VQA

Model	Params (M)	Accuracy	Config	Download
`blip-base_3rdparty_vqa`*	361.48	28.30#	config	model

Image-To-Text Retrieval on COCO

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	82.52	95.34	config	model

Text-To-Image Retrieval on COCO

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	64.82	86.28	config	model

Image-To-Text Retrieval on Flickr30k

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	95.10#	99.60#	config	model

Text-To-Image Retrieval on Flickr30k

Model	Params (M)	Recall@1	Recall@5	Config	Download
`blip-base_3rdparty_retrieval`*	447.49	85.26#	96.58#	config	model

NLVR on NLVR2

Model	Params (M)	Top-1 (%)	Config	Download
`blip-base_3rdparty_nlvr`*	259.37	82.33	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We haven't reproduce the training results.

Results with # denote zero-shot evaluation. The corresponding model hasn't been finetuned on that dataset.

Citation

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BLIP

Abstract

How to use it?

Models and results

Image Caption on COCO

Image Caption on NoCaps

Image Caption on Flickr30k

Visual Grounding on RefCOCO

Visual Question Answering on VQAv2

Visual Question Answering on OK-VQA

Visual Question Answering on OCR-VQA

Image-To-Text Retrieval on COCO

Text-To-Image Retrieval on COCO

Image-To-Text Retrieval on Flickr30k

Text-To-Image Retrieval on Flickr30k

NLVR on NLVR2

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

BLIP

Abstract

How to use it?

Models and results

Image Caption on COCO

Image Caption on NoCaps

Image Caption on Flickr30k

Visual Grounding on RefCOCO

Visual Question Answering on VQAv2

Visual Question Answering on OK-VQA

Visual Question Answering on OCR-VQA

Image-To-Text Retrieval on COCO

Text-To-Image Retrieval on COCO

Image-To-Text Retrieval on Flickr30k

Text-To-Image Retrieval on Flickr30k

NLVR on NLVR2

Citation