GitHub - njucckevin/KnowCap: Code for Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

This repo provides the source code & data of our paper: Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model. (ACMMM 23)

@misc{cheng2023generic,
      title={Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model}, 
      author={Kanzhi Cheng and Wenpo Song and Zheng Ma and Wenhao Zhu and Zixuan Zhu and Jianbing Zhang},
      year={2023},
      eprint={2308.01126},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Code Structure

├── config.py	# config
├── data	# coco data & knowcap data
│   ├── data_cc12m_SelectForRreplay.json
│   ├── dataset_coco.json
│   ├── test.json
│   ├── train.json
│   ├── val.json
│   ├── knowcap_240.json
│   ├── knowcap_240_test.json
│   ├── knowcap_240_test_unseen.json
│   ├── knowcap_240_val.json
│   ├── train_mix_32000.json
│   └── ...
├── data_load.py	# dataloader
├── test.py	    # evaluation on coco
├── test_knowcap.py	    # evaluation on knowcap
├── models	# models (OFA,BLIP,GIT)
│   ├── OFA
│   ├── BLIP
│   └── GIT
├── train_multitask.py      # K-Replay training
└── utils	# support codes & tools
    ├── beamsearch.py	# beamsearch
    ├── cc12m.py	# filter relay data from cc12m
    ├── convert_ofa.py	# ckpts convert
    ├── eval.py		# generate captions & calculate metrics
    ├── import_models.py
    ├── log.py
    ├── loss.py		# loss function of K-Replay
    ├── optimizer_tools.py
    └── prepro_data.py   # construct the data in ./data

KnowCap Dataset

KnowCap is a new dataset for the evaluation of knowledge-enhanced image captioning, containing 1424 images and 4156 reference descriptions carefully written by human annotators.

Download the images and annotations of KnowCap.

Preparing Data&Model

Step1：

Download the images of:

Step2：

prepro_data.py, Collate and split coco and knowcap datasets in ./data.

Alternatively, we provide the processed data that can be put into . /data directory. Note that the file_path in each dataset needs to be modified according to the path of the downloaded image in step1. Similarly, some of the parameters in config need to be modified depending on your own.

Step3:

Prepare the ckpts of VLP models (take OFA as an example) for training and testing.

Download the transformers version ckpts of OFA
However, since there are some problems with the official ckpts in transformers, we manually replaced the original parameters with the official ckpts in fairseq using convert_ofa.py

Alternatively, we provide the converted ckpts.

Reproduce the main results

The baseline result of OFA in knowcap: CUDA_VISIBLE_DEVICES=0 python test_knowcap.py --model OFA --ofa_ckpts xxx --length_penalty 1.0, the ofa_ckpts is obtained in step3.

The OFA+K-Replay result in knowcap: CUDA_VISIBLE_DEVICES=0 python test_knowcap.py --model OFA --trained_ckpts xxx --length_penalty 1.0, the trained_ckpts can be downloaded in here.

To evaluate on coco, use test.py instead of test_knowcap.py.

Tips:

To eliminate the need for coco_id in the evaluation, we customized the COCOEval function in eval.py. Therefore the xxx/site-packages/pycocoevalcap/eval.py needs to be replaced or modified with the eval.py to use the current evaluation code.

Training with K-Replay

Step4：

Start Training with K-Replay: CUDA_VISIBLE_DEVICES=0 python train_multitask.py --mode train --model OFA --id ofa_kreplay --batch_size 60 --learning_rate 7e-6 --label_smoothing 0.1 --multitask_weight 1.0 --KD_temperature 16.0 --knowdistill_weight 1.0 --save_model_freq 100 --ofa_ckpts /home/chengkz/checkpoints/ofa/OFA-large-caption-trainedenc --ofa_ckpts_distill /home/chengkz/checkpoints/ofa/OFA-large-caption-XEfinetuned --train_mix ./data/train_mix_32000.json --method XEdistill.

The ofa_ckpts and ofa_ckpts_distill are obtained in step3, train_mix_32000.json is obtained in step2.

Step5:

Evaluation on COCO: CUDA_VISIBLE_DEVICES=0 python test.py --model OFA --id ofa_kreplay --step 300 --length_penalty 1.0.

Evaluation on KnowCap: CUDA_VISIBLE_DEVICES=0 python test_knowcap.py --model OFA --id ofa_kreplay --step 300 --length_penalty 1.0.

Tips:

OFA uses resnet as the backbone of its visual encoder. In our experiments, we found that the batchnorm layers in the resnet backbone do not give good estimates of the mean and std due to the small batchsize we used, which leads to a degradation of the model performance. Therefore, we fixed the mean and std of these layers during training, by setting momentum=0.0 in ./transformers/models/ofa/resnet.py.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
__pycache__		__pycache__
data		data
evaluation		evaluation
models		models
utils		utils
.DS_Store		.DS_Store
config.py		config.py
data_load.py		data_load.py
eval.py		eval.py
knowcap.png		knowcap.png
readme.md		readme.md
requirements.txt		requirements.txt
test.py		test.py
test_knowcap.py		test_knowcap.py
train_multitask.py		train_multitask.py

njucckevin/KnowCap

Folders and files

Latest commit

History

Repository files navigation

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Code Structure

KnowCap Dataset

Preparing Data&Model

Step1：

Step2：

Step3:

Reproduce the main results

Tips:

Training with K-Replay

Step4：

Step5:

Tips:

About

Resources

Stars

Watchers

Forks

Languages