Pytorch Implementation of ACM MM 2024 paper "VrdONE: One-stage Video Visual Relation Detection".
Video Visual Relation Detection (VidVRD) focuses on understanding how entities interact over time and space in videos, a key step for gaining deeper insights into video scenes beyond basic visual tasks. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation categories are present and another for determining their temporal boundaries. This split overlooks the inherent connection between these elements. Addressing the need to recognize entity pairs' spatiotemporal interactions across a range of durations, we propose VrdONE, a streamlined yet efficacious one-stage model. VrdONE combines the features of subjects and objects, turning predicate detection into 1D instance segmentation on their combined representations. This setup allows for both relation category identification and binary mask generation in one go, eliminating the need for extra steps like proposal generation or post-processing. VrdONE facilitates the interaction of features across various frames, adeptly capturing both short-lived and enduring relations. Additionally, we introduce the Subject-Object Synergy (SOS) module, enhancing how subjects and objects perceive each other before combining. VrdONE achieves state-of-the-art performances on the VidOR benchmark and ImageNet-VidVRD, showcasing its superior capability in discerning relations across different temporal scales.
- Installation
- prepare VidOR dataset
- prepare ImageNet-VidVRD datset
- train VrdONE on VidOR
- train VrdONE-X on VidOR
- train VrdONE on ImageNet-VidVRD
- evaluate VrdONE on VidOR
- evaluate VrdONE on ImageNet-VidVRD
- This repository needs
python=3.10.14
,pytorch=1.12.1
, andtorchvision=0.13.1
- Run the following command to install the required packages.
pip install -r requirements.txt
- Clone Shang's evaluation helper https://github.com/xdshang/VidVRD-helper to the root path.
Install ffmpeg
using sudo apt-get install ffmpeg
and the organization of datasets should be like this:
├── datasets
│ ├── vidor
│ │ ├── annotations
| | | ├── training
| | | | ├── 0000
| | | | ├── ...
| | | | └── 1203
| | | └── validation
| | | ├── 0001
| | | ├── ...
| | | └── 1203
| | ├── features
| | | ├── GT_boxfeatures_training
| | | ├── MEGA_VidORval_cache
| | | | └─ MEGAv9_m60s0.3_freq1_VidORval_freq1_th_15-180-200-0.40.pkl
| | | └── vidor_per_video_val
| | ├── frames
│ │ └── videos
| | ├── 0000
| | ├── ...
| | └── 1203
| ├── gt_json_eval
- Download the VidOR, unzip all videos (training and validation) into
datasets/vidor/videos
. Unzip the training and validation annoatations intodatasets/vidor/annotations
. - Go to the
datasets
directory and run the following command to decode the videos into frames.python vidor_video_to_frames.py
- Extract visual features from gt bounding boxes. We follow the Gao's method from https://github.com/Dawn-LX/VidVRD-tracklets. First, download the pretrained weight of MEGA and put it into
datasets/mega/ckpts
. Step intodatasets/mega
and run the following command to extract features.bash scripts/extract_vidor_gt.sh [gpu_id]
- Download the extracted proposal features of validation set from Gao's method (BIG). Then, put it into
datasets/vidor/features/MEGA_VidORval_cache
. We copy thedataloader
part from BIG. Step intodatasets/VidSGG-BIG
and divide the proposal features into per-video ones by the following command:python prepare_vidor_proposal.py
Coming soon ...
Coming soon...
- VidOR: download the vrdone ckpt and run the command:
or just run the scripts:
python eval_vidor.py \ --cfg_path configs/vidor.yaml \ --exp_dir experiments/vrdone_vidor \ --ckpt_path ckpts/ckpt_vidor.pth \ --topk 1 \
bash scripts/eval_vidor.sh [gpu_id]
Model | Dataset | Extra Features | Download Path |
---|---|---|---|
VrdONE | VidOR | - | Hugging Face |
VrdONE-X | VidOR | CLIP | |
VrdONE | ImageNet-VidVRD | - |
@inproceedings{jiang2024vrdone,
author = {Jiang, Xinjie and Zheng, Chenxi and Xu, Xuemiao and Liu, Bangzhen and Zheng, Weiying and Zhang, Huaidong and He, Shengfeng},
title = {VrdONE: One-stage Video Visual Relation Detection},
booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia},
year = {2024},
}
This project is mainly based on ActionFormer, MaskFormer, and BIG. Thanks for their amazing projects!