Skip to content

jy0205/STCAT

Repository files navigation

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

NeurIPS 2022, Spotlight Presentation, [arXiv] [BibTeX]

Introduction

We propose STCAT, a new one-stage spatio-temporal video grounding method, which achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks. This repository provides the Pytorch Implementations for the model training and evaluation. For more details, please refer to our paper.


Dataset Preparation

The used datasets are placed in data folder with the following structure.

data
|_ vidstg
|  |_ videos
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ vstg_annos
|  |  |_ train.json
|  |  |_ ...
|  |_ sent_annos
|  |  |_ train_annotations.json
|  |  |_ ...
|  |_ data_cache
|  |  |_ ...
|_ hc-stvg
|  |_ v1_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ hcstvg_v1
|  |  |  |_ train.json
|  |  |  |_ test.json
|  |  data_cache
|  |  |_ ...

You can prepare this structure with the following steps:

VidSTG

  • Download the video for VidSTG from the VidOR and put it into data/vidstg/videos. The original video download url given by the VidOR dataset provider is broken. You can download the VidSTG videos from this.
  • Download the text and temporal annotations from VidSTG Repo and put it into data/vidstg/sent_annos.
  • Download the bounding-box annotations from here and put it into data/vidstg/vstg_annos.
  • For the loading efficiency, we provide the dataset cache for VidSTG at here. You can download it and put it into data/vidstg/data_cache.

HC-STVG

  • Download the version 1 of HC-STVG videos and annotations from HC-STVG. Then put it into data/hc-stvg/v1_video and data/hc-stvg/annos/hcstvg_v1.
  • For the loading efficiency, we provide the dataset cache for HC-STVG at here. You can download it and put it into data/hc-stvg/data_cache.

Setup

Requirements

The code is tested with PyTorch 1.10.0. The other versions may be compatible as well. You can install the requirements with the following commands:

conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt

Then, download FFMPEG 4.1.9 and add it to the PATH environment variable for loading the video.

Pretrained Checkpoints

Our model leveraged the ResNet-101 pretrained by MDETR as the vision backbone. Please download the pretrained weight from here and put it into data/pretrained/pretrained_resnet101_checkpoint.pth.

Usage

Note: You should use one video per GPU during training and evaluation, more than one video per GPU is not tested and may cause some bugs.

Training

For training on an 8-GPU node, you can use the following script:

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/train_net.py \
 --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
 --use-seed \
 OUTPUT_DIR data/vidstg/checkpoints/output \
 TENSORBOARD_DIR data/vidstg/checkpoints/output/tensorboard \
 INPUT.RESOLUTION 448

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/train_net.py \
 --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
 --use-seed \
 OUTPUT_DIR data/hc-stvg/checkpoints/output \
 TENSORBOARD_DIR data/hc-stvg/checkpoints/output/tensorboard \
 INPUT.RESOLUTION 448

For more training options (like using other hyper-parameters), please modify the configurations experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml and experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml.

Evaluation

To evaluate the trained STCAT models, please run the following scripts:

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
 --use-seed \
 MODEL.WEIGHT data/vidstg/checkpoints/stcat_res448/vidstg_res448.pth \
 OUTPUT_DIR data/vidstg/checkpoints/output \
 INPUT.RESOLUTION 448

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
 --use-seed \
 MODEL.WEIGHT data/hc-stvg/checkpoints/stcat_res448/hcstvg_res448.pth \
 OUTPUT_DIR data/hc-stvg/checkpoints/output \
 INPUT.RESOLUTION 448

Model Zoo

We provide our trained checkpoints with ResNet-101 backbone for results reproducibility.

Dataset resolution url Declarative (m_vIoU/vIoU@0.3/vIoU@0.5) Interrogative (m_vIoU/vIoU@0.3/vIoU@0.5) size
VidSTG 416 Model 32.94/46.07/32.32 27.87/38.89/26.07 3.1GB
VidSTG 448 Model 33.14/46.20/32.58 28.22/39.24/26.63 3.1GB
Dataset resolution url m_vIoU/vIoU@0.3/vIoU@0.5 size
HC-STVG 416 Model 34.93/56.64/31.03 3.1GB
HC-STVG 448 Model 35.09/57.67/30.09 3.1GB

Acknowledgement

This repo is partly based on the open-source release from MDETR, DAB-DETR and MaskRCNN-Benchmark. The evaluation metric implementation is borrowed from TubeDETR for a fair comparison.

License

STCAT is released under the MIT license.

Citation

Consider giving this repository a star and cite it in your publications if it helps your research.

@article{jin2022embracing,
  title={Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding},
  author={Jin, Yang and Li, Yongzhi and Yuan, Zehuan and Mu, Yadong},
  journal={arXiv preprint arXiv:2209.13306},
  year={2022}
}

About

[NeurIPS 2022] Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published