Skip to content

Latest commit

 

History

History
61 lines (40 loc) · 4.95 KB

README.md

File metadata and controls

61 lines (40 loc) · 4.95 KB

VideoMAE

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data.

Results and Models

Kinetics-400

frame sampling strategy resolution backbone top1 acc top5 acc reference top1 acc reference top5 acc testing protocol FLOPs params config ckpt
16x4x1 short-side 320 ViT-B 81.3 95.0 81.5 [VideoMAE] 95.1 [VideoMAE] 5 clips x 3 crops 180G 87M config ckpt [1]
16x4x1 short-side 320 ViT-L 85.3 96.7 85.2 [VideoMAE] 96.8 [VideoMAE] 5 clips x 3 crops 597G 305M config ckpt [1]

[1] The models are ported from the repo VideoMAE and tested on our data. Currently, we only support the testing of VideoMAE models, training will be available soon.

  1. The values in columns named after "reference" are the results of the original repo.
  2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at Kinetics400-Validation. The corresponding data list (each line is of the format 'video_id, num_frames, label_index') and the label map are also available.

For more details on data preparation, you can refer to preparing_kinetics.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test ViT-base model on Kinetics-400 dataset and dump the result to a pkl file.

python tools/test.py configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}