Visual Semantic Description Generation with MLLMs for Image-Text Matching

The codes for our paper "Visual Semantic Description Generation with MLLMs for Image-Text Matching(VSD)", ,which is accepted by the ICME2025. We referred to the implementations of GPO, HREM, and MiniCPM-V to build up our codes. We express our gratitude for these outstanding works.

Introduction

Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations—continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM.

Performance

Note: We have open-sourced the complete implementation code for GPO+VSD⋆, GPO+VSD†, HREM+VSD⋆, and HREM+VSD†. For the CLIP+VSD⋆ and CLIP+VSD† versions, researchers can reproduce the results based on the technical details in our published paper and the already open-sourced related code. As the structure and organization of this portion of code require further optimization, we plan to release it after completing the code refactoring.

Preparation

Environments

We recommended the following dependencies.

Python 3.9
PyTorch 1.11
transformers 4.36.0
open-clip-torch 2.24.0
numpy 1.23.5
tensorboard-logger 0.1.0
The specific required environment can be found here

Data

All data sets used in the experiment and the necessary external components are organized in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── train_aux_cap_bge_cpm_full.npy
│   │      ├── train_aux_cap_bge_flor_det.npy
│   │      ├── train_cap_bge.npy
│   │      ├── testall_aux_cap_bge_cpm_full_1.npy
│   │      ├── testall_aux_cap_bge_flor_det.npy
│   │      ├── testall_cap_bge.npy
│   │      ├── ......
│   │
│   │── id_mapping.json
│   ├── images   # (option) raw coco images for OpenCLIP
│        ├── train2014
│        └── val2014
│  
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── train_aux_cap_bge_cpm_full.npy
│   │      ├── train_aux_cap_bge_flor_det.npy
│   │      ├── train_cap_bge.npy
│   │      ├── test_aux_cap_bge_cpm_full_1.npy
│   │      ├── test_aux_cap_bge_flor_det.npy
│   │      ├── ......
│   │
│   │── id_mapping.json
│   ├── images   # (option) raw flickr30k images for OpenCLIP
│          ├── xxx.jpg
│          └── ...
│   
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

VSD
├── bert-base-uncased    # the pretrained checkpoint files for BERT-base
│   ├── config.json
│   ├── tokenizer_config.txt
│   ├── vocab.txt
│   ├── pytorch_model.bin
│   ├── ......

└── CLIP                         # (option) the pretrained checkpoint files for OpenCLIP
│   ├── config.json
│   ├── tokenizer_config.json
│   ├── vocab.json
│   ├── open_clip_config.json
│   ├── open_clip_pytorch_model.bin
│   ├── ......
│  
└── ....

Data Sources:

Visual semantics describe preprocessed features: Baidu Yun (code: EVDP)
BUTD features: SCAN (Kaggle) or Baidu Yun (code: AAHR)
MSCOCO images: Official or PaddlePaddle
Flickr30K images: Official or SCAN (Kaggle)
Pretrained models: BERT-base-uncased , MiniCPM-V 2.6 , Florence-2-large-ft ,bge-large-en-v1.5 and OpenCLIP from HuggingFace

Training

Train MSCOCO and Flickr30K from scratch:

bash  run_f30k.sh

bash  run_coco.sh

Evaluation

Modify the corresponding parameters in eval.py to test the Flickr30K or MSCOCO data set:

python eval.py  --dataset f30k  --data_path "path/to/dataset"

python eval.py  --dataset coco --data_path "path/to/dataset"

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and a citation 📝:

@inproceedings{chen2025VSD,
  title={Visual Semantic Description Generation with MLLMs for Image-Text Matching},
  author={Chen, Junyu and Gao, Yihua and Li, Mingyong},
  booktitle={2025 IEEE International Conference on Multimedia and Expo (ICME)},
  pages={1--6},
  year={2025},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
VSD_HREM		VSD_HREM
additional data		additional data
lib		lib
proto_data		proto_data
LICENSE		LICENSE
Paper# 3259_Visual Semantic Description Generation with MLLMs for Image-Text Matching.pdf		Paper# 3259_Visual Semantic Description Generation with MLLMs for Image-Text Matching.pdf
README.md		README.md
arguments.py		arguments.py
cross_domin_result.png		cross_domin_result.png
cross_modal_result.png		cross_modal_result.png
eval.py		eval.py
eval_bge.py		eval_bge.py
eval_cocoxf30k.py		eval_cocoxf30k.py
main_result.png		main_result.png
overview.png		overview.png
run_coco.sh		run_coco.sh
run_f30k.sh		run_f30k.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Semantic Description Generation with MLLMs for Image-Text Matching

Introduction

Performance

Preparation

Environments

Data

Data Sources:

Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Visual Semantic Description Generation with MLLMs for Image-Text Matching

Introduction

Performance

Preparation

Environments

Data

Data Sources:

Training

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages