cvpr22

This is the source code of our paper for CVPR 2022. The original idea of this paper is the multi-modal dynamic graph transformer framework for progressive learning on visual grounding.

Summary

Introduction
Getting Started
Datasets
Structure
Operation
Performance
Acknowledgements

Introduction

The multi-modal dynamic graph transformer (M-DGT) framework is built upon a novel idea that models the learning of the visual grounding as the graph transformation. To accomplish this, there are two crucial components, including the multi-modal node transformer and the graph transformer. With the anchor in the image as the node, the multi-modal node transformer firstly constructs a graph based on the spatial positions of anchors. Then it continuously produces the 2D transformation coefficients to modify these spatial positions to approach the ground truth regions. Then, during this process, the graph transformer optimizes the structure of the graph by removing nodes and unnecessary edges to motivate efficient learning. Therefore, the whole framework can be regarded as generating a series of dynamic graphs that gradually shrink to the target regions. The performance of the M-DGT is measured in two parts, including visual grounding and phrase grounding. The ReferItGame and RefCOCO datasets are used to measure the performance of the M-DGT on visual grounding, while the Flickr30K Entities dataset is utilized to test the M-DGT on phrase grounding.

Getting Started

Please install the necessary Python packages before running the code. The main Python packages are PyTorch, torchvision, mmcv, opencv, albumentations.

Then, you need to download three corresponding datasets and then set the path in the 'common_flags.py'.

Datasets

This paper utilizes the following three classic datasets:

ReferItGame dataset. Please check the data provider souce code for this dataset in the 'datasets/ReferItGame_provider'.
Flickr30K Entities dataset. Please check the data provider souce code for this dataset in the 'datasets/F30KE_provider'.
RefCOCO dataset. Please check the data provider souce code for this dataset in the 'datasets/ReferItGame_provider'.

Structure

Folder structure and functions for M-DGT

.
├── datasets                        # Three datasets used in experiments
├── experiments                     # The directory used to save the model and logging file
├── learning                        # The train, evaluation, and losses parts
├── models                          # The components and main structures of M-DGT
├── preprocess                      # The function used to preprocess the original image 
├── visualization                   # Visualization part of M-DGT
└── common_flags.md                 # Set common paths for datasets
└── f30k_eval.py                    # Evaluate the model on eval/test sets of the Flickr30K Entities dataset.
└── f30k_train.py                   # Train the model on the train set of the Flickr30K Entities dataset.
└── refcoco_eval.py                 # Evaluate the model on eval/tests set of ReferItGame/RefCOCO datasets.
└── refcoco_train.py                # Train the model on the train set of ReferItGame/RefCOCO datasets.
└── README.md

Operation

After setting the environment and three datasets, the model can be trained directly by running:

Operations on the Flickr30K Entities dataset

Train the model

cvpr@cvpr22:~$ python f30k_train.py

Test the model; Set 'phase' to switch between test and val.

cvpr@cvpr22:~$ python f30k_eval.py

Operations on the RefCOCO/ReferItGame dataset

Train the model; Set 'data_name' and 'split_type' to switch between different datasets.

cvpr@cvpr22:~$ python refcoco_train.py

Test the model; Set 'phase' to switch between test and val.

cvpr@cvpr22:~$ python refcoco_eval.py

Performance

Flickr30K Entities

(%)	top-1 accuracy
SOTA	76.74 Learning Cross-Modal Context Graph
M-DGT	79.97

RefCOCO

(%)	ReferCOCO	ReferCOCO+	ReferCOCOg
type	Val / TestA / TestB	Val / TestA / TestB	Val / Test
SOTA	82 / 81.20 / 84.00	66.6 / 67.6 / 65.5	75.73 / 75.31
MMDGT	85.37 / 84.82 / 87.11	70.02 / 72.26 / 68.92	79.21 / 79.06

Acknowledgements

Our implementations refer to the source code from the following repositories and users:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cvpr22

Summary

Introduction

Getting Started

Datasets

Structure

Operation

Performance

Flickr30K Entities

RefCOCO

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
learning		learning
models		models
preprocess		preprocess
visualization		visualization
.gitignore		.gitignore
README.md		README.md
common_flags.py		common_flags.py
f30k_eval.py		f30k_eval.py
f30k_train.py		f30k_train.py
refcoco_eval.py		refcoco_eval.py
refcoco_train.py		refcoco_train.py

iQua/M-DGT

Folders and files

Latest commit

History

Repository files navigation

cvpr22

Summary

Introduction

Getting Started

Datasets

Structure

Operation

Performance

Flickr30K Entities

RefCOCO

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages