VrdONE: One-stage Video Visual Relation Detection

Pytorch Implementation of ACM MM 2024 paper "VrdONE: One-stage Video Visual Relation Detection".

[arXiv] [ACM MM]

VrdONE

Abstract

Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.

Todo List

Installation

This repository needs python=3.10.14, pytorch=1.12.1, and torchvision=0.13.1
Run the following command to install the required packages.
```
pip install -r requirements.txt
```
Clone Shang's evaluation helper https://github.com/xdshang/VidVRD-helper to the root path.

Data Preparation

Install ffmpeg using sudo apt-get install ffmpeg and the organization of datasets should be like this:


├── datasets
│   ├── vidor
│   │   ├── annotations
|   |   |   ├── training
|   |   |   |   ├── 0000
|   |   |   |   ├── ...
|   |   |   |   └── 1203
|   |   |   └── validation
|   |   |       ├── 0001
|   |   |       ├── ...
|   |   |       └── 1203
|   |   ├── features
|   |   |   ├── GT_boxfeatures_training
|   |   |   ├── MEGA_VidORval_cache
|   |   |   |   └─ MEGAv9_m60s0.3_freq1_VidORval_freq1_th_15-180-200-0.40.pkl
|   |   |   └── vidor_per_video_val
|   |   ├── frames
│   │   └── videos
|   |       ├── 0000
|   |       ├── ...
|   |       └── 1203
|   ├── gt_json_eval

VidOR

Download the VidOR, unzip all videos (training and validation) into datasets/vidor/videos. Unzip the training and validation annoatations into datasets/vidor/annotations.
Go to the datasets directory and run the following command to decode the videos into frames.
```
python vidor_video_to_frames.py
```
Extract visual features from gt bounding boxes. We follow the Gao's method from https://github.com/Dawn-LX/VidVRD-tracklets. First, download the pretrained weight of MEGA and put it into datasets/mega/ckpts. Step into datasets/mega and run the following command to extract features.
```
bash scripts/extract_vidor_gt.sh [gpu_id]
```
Download the extracted proposal features of validation set from Gao's method (BIG). Then, put it into datasets/vidor/features/MEGA_VidORval_cache. We copy the dataloader part from BIG. Step into datasets/VidSGG-BIG and divide the proposal features into per-video ones by the following command:
```
python prepare_vidor_proposal.py
```

VidVRD

Coming soon ...

Train

Coming soon...

Eval

VidOR: download the vrdone ckpt and run the command:

python eval_vidor.py \
    --cfg_path configs/vidor.yaml \
    --exp_dir experiments/vrdone_vidor \
    --ckpt_path ckpts/ckpt_vidor.pth \
    --topk 1 \

or just run the scripts:

bash scripts/eval_vidor.sh [gpu_id]

Model Zoo

Model	Dataset	Extra Features	Download Path
VrdONE	VidOR	-	Hugging Face
VrdONE-X	VidOR	CLIP
VrdONE	ImageNet-VidVRD	-

Citation

@inproceedings{jiang2024vrdone,
  author = {Jiang, Xinjie and Zheng, Chenxi and Xu, Xuemiao and Liu, Bangzhen and Zheng, Weiying and Zhang, Huaidong and He, Shengfeng},
  title = {VrdONE: One-stage Video Visual Relation Detection},
  booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
  year = {2024},
}

Acknowledgement

This project is mainly based on ActionFormer, MaskFormer, and BIG. Thanks for their amazing projects!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
dataloaders		dataloaders
datasets		datasets
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_vidor.py		eval_vidor.py
requirements.txt		requirements.txt
train_vidor.py		train_vidor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VrdONE: One-stage Video Visual Relation Detection

VrdONE

Abstract

Todo List

Installation

Data Preparation

VidOR

VidVRD

Train

Eval

Model Zoo

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

lucaspk512/vrdone

Folders and files

Latest commit

History

Repository files navigation

VrdONE: One-stage Video Visual Relation Detection

VrdONE

Abstract

Todo List

Installation

Data Preparation

VidOR

VidVRD

Train

Eval

Model Zoo

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages