The codes for our paper "Visual Semantic Description Generation with MLLMs for Image-Text Matching(VSD)", ,which is accepted by the ICME2025. We referred to the implementations of GPO, HREM, and MiniCPM-V to build up our codes. We express our gratitude for these outstanding works.
Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations—continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM.
Note: We have open-sourced the complete implementation code for GPO+VSD⋆, GPO+VSD†, HREM+VSD⋆, and HREM+VSD†. For the CLIP+VSD⋆ and CLIP+VSD† versions, researchers can reproduce the results based on the technical details in our published paper and the already open-sourced related code. As the structure and organization of this portion of code require further optimization, we plan to release it after completing the code refactoring.
We recommended the following dependencies.
- Python 3.9
- PyTorch 1.11
- transformers 4.36.0
- open-clip-torch 2.24.0
- numpy 1.23.5
- tensorboard-logger 0.1.0
- The specific required environment can be found here
All data sets used in the experiment and the necessary external components are organized in the following manner:
data
├── coco
│ ├── precomp # pre-computed BUTD region features for COCO, provided by SCAN
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── train_aux_cap_bge_cpm_full.npy
│ │ ├── train_aux_cap_bge_flor_det.npy
│ │ ├── train_cap_bge.npy
│ │ ├── testall_aux_cap_bge_cpm_full_1.npy
│ │ ├── testall_aux_cap_bge_flor_det.npy
│ │ ├── testall_cap_bge.npy
│ │ ├── ......
│ │
│ │── id_mapping.json
│ ├── images # (option) raw coco images for OpenCLIP
│ ├── train2014
│ └── val2014
│
├── f30k
│ ├── precomp # pre-computed BUTD region features for Flickr30K, provided by SCAN
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── train_aux_cap_bge_cpm_full.npy
│ │ ├── train_aux_cap_bge_flor_det.npy
│ │ ├── train_cap_bge.npy
│ │ ├── test_aux_cap_bge_cpm_full_1.npy
│ │ ├── test_aux_cap_bge_flor_det.npy
│ │ ├── ......
│ │
│ │── id_mapping.json
│ ├── images # (option) raw flickr30k images for OpenCLIP
│ ├── xxx.jpg
│ └── ...
│
└── vocab # vocab files provided by SCAN (only used when the text backbone is BiGRU)
VSD
├── bert-base-uncased # the pretrained checkpoint files for BERT-base
│ ├── config.json
│ ├── tokenizer_config.txt
│ ├── vocab.txt
│ ├── pytorch_model.bin
│ ├── ......
└── CLIP # (option) the pretrained checkpoint files for OpenCLIP
│ ├── config.json
│ ├── tokenizer_config.json
│ ├── vocab.json
│ ├── open_clip_config.json
│ ├── open_clip_pytorch_model.bin
│ ├── ......
│
└── ....
- Visual semantics describe preprocessed features: Baidu Yun (code: EVDP)
- BUTD features: SCAN (Kaggle) or Baidu Yun (code: AAHR)
- MSCOCO images: Official or PaddlePaddle
- Flickr30K images: Official or SCAN (Kaggle)
- Pretrained models: BERT-base-uncased , MiniCPM-V 2.6 , Florence-2-large-ft ,bge-large-en-v1.5 and OpenCLIP from HuggingFace
Train MSCOCO and Flickr30K from scratch:
bash run_f30k.sh
bash run_coco.sh
Modify the corresponding parameters in eval.py to test the Flickr30K or MSCOCO data set:
python eval.py --dataset f30k --data_path "path/to/dataset"
python eval.py --dataset coco --data_path "path/to/dataset"
If you find our paper and code useful in your research, please consider giving a star ⭐ and a citation 📝:
@inproceedings{chen2025VSD,
title={Visual Semantic Description Generation with MLLMs for Image-Text Matching},
author={Chen, Junyu and Gao, Yihua and Li, Mingyong},
booktitle={2025 IEEE International Conference on Multimedia and Expo (ICME)},
pages={1--6},
year={2025},
organization={IEEE}
}


