open-mmlab · ly015 · Oct 25, 2022 · Sep 1, 2022 · Sep 1, 2022 · Sep 1, 2022
diff --git a/configs/recognition/videomae/README.md b/configs/recognition/videomae/README.md
@@ -0,0 +1,63 @@
+# VideoMAE
+
+[VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 85.8% on Kinetics-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51, without using any extra data.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/35267818/191656296-14f28f4a-203f-4eeb-a4c3-c2efdb6d1ab4.png" width="800"/>
+</div>
+
+## Results and Models
+
+### Kinetics-400
+
+| frame sampling strategy |   resolution   | backbone | top1 acc | top5 acc |         reference top1 acc         |         reference top5 acc         | testing protocol  | FLOPs | params |         config         |         ckpt          |
+| :---------------------: | :------------: | :------: | :------: | :------: | :--------------------------------: | :--------------------------------: | :---------------: | :---: | :----: | :--------------------: | :-------------------: |
+|         16x4x1          | short-side 320 |  ViT-B   |   81.3   |   95.0   | 81.5 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 95.1 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 5 clips x 3 crops | 180G  |  87M   | [config](/configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-860a3cd3.pth) \[1\] |
+|         16x4x1          | short-side 320 |  ViT-L   |   85.3   |   96.7   | 85.2 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 96.8 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 5 clips x 3 crops | 597G  |  305M  | [config](/configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth) \[1\] |
+
+\[1\] The models are ported from the repo [VideoMAE](https://github.com/MCG-NJU/VideoMAE) and tested on our data. Currently, we only support the testing of VideoMAE models, training will be available soon.
+
+1. The values in columns named after "reference" are the results of the original repo.
+2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+
+For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md).
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test ViT-base model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py \
+    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
+
+## Citation
+
+```BibTeX
+@misc{feichtenhofer2020x3d,
+      title={X3D: Expanding Architectures for Efficient Video Recognition},
+      author={Christoph Feichtenhofer},
+      year={2020},
+      eprint={2004.04730},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/configs/recognition/videomae/metafile.yml b/configs/recognition/videomae/metafile.yml
@@ -0,0 +1,43 @@
+Collections:
+- Name: VideoMAE
+  README: configs/recognition/videomae/README.md
+  Paper:
+    URL: https://arxiv.org/abs/2203.12602
+    Title: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training"
+
+Models:
+  - Name: vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400
+    Config: configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py
+    In Collection: VideoMAE
+    Metadata:
+      Architecture: ViT-B
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md
+      Code: https://github.com/MCG-NJU/VideoMAE/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.3
+        Top 5 Accuracy: 95.0
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-860a3cd3.pth
+
+  - Name: vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400
+    Config: configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py
+    In Collection: VideoMAE
+    Metadata:
+      Architecture: ViT-L
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md
+      Code: https://github.com/MCG-NJU/VideoMAE/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 85.3
+        Top 5 Accuracy: 96.7
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth
diff --git a/configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py b/configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py
@@ -0,0 +1,61 @@
+_base_ = ['../../_base_/default_runtime.py']
+
+# model settings
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=224,
+        patch_size=16,
+        embed_dims=768,
+        depth=12,
+        num_heads=12,
+        mlp_ratio=4,
+        qkv_bias=True,
+        num_frames=16,
+        norm_cfg=dict(type='LN', eps=1e-6)),
+    cls_head=dict(
+        type='TimeSformerHead',
+        num_classes=400,
+        in_channels=768,
+        average_clips='prob'),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        format_shape='NCTHW'))
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+
+test_pipeline = [
+    dict(type='DecordInit'),
+    dict(
+        type='SampleFrames',
+        clip_len=16,
+        frame_interval=4,
+        num_clips=5,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='ThreeCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True))
+
+test_evaluator = dict(type='AccMetric')
+test_cfg = dict(type='TestLoop')
diff --git a/configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py b/configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py
@@ -0,0 +1,6 @@
+_base_ = ['vit-base_videomae-k400-pre_16x4x1_kinetics-400.py']
+
+# model settings
+model = dict(
+    backbone=dict(embed_dims=1024, depth=24, num_heads=16),
+    cls_head=dict(in_channels=1024))
diff --git a/mmaction/datasets/transforms/loading.py b/mmaction/datasets/transforms/loading.py
@@ -174,9 +174,7 @@ def _get_train_clips(self, num_frames):
     def _get_test_clips(self, num_frames):
         """Get clip offsets in test mode.
 
-        Calculate the average interval for selected frames, and shift them
-        fixedly by avg_interval/2. If set twice_sample True, it will sample
-        frames together without fixed shift. If the total number of frames is
+        If the total number of frames is
         not enough, it will return all zero indices.
 
         Args:
@@ -185,15 +183,18 @@ def _get_test_clips(self, num_frames):
         Returns:
             np.ndarray: Sampled frame indices in test mode.
         """
-        ori_clip_len = self.clip_len * self.frame_interval
-        avg_interval = (num_frames - ori_clip_len + 1) / float(self.num_clips)
-        if num_frames > ori_clip_len - 1:
-            base_offsets = np.arange(self.num_clips) * avg_interval
-            clip_offsets = (base_offsets + avg_interval / 2.0).astype(np.int32)
-            if self.twice_sample:
-                clip_offsets = np.concatenate([clip_offsets, base_offsets])
+        k = 2 if self.twice_sample else 1
+        num_clips = self.num_clips * k
+        ori_clip_len = (self.clip_len - 1) * self.frame_interval + 1
+        max_offset = max(num_frames - ori_clip_len, 0)
+
+        if num_clips > 1:
+            num_segments = num_clips - 1
+            offset_between = max_offset / float(num_segments)
+            clip_offsets = np.arange(num_clips) * offset_between
+            clip_offsets = np.round(clip_offsets).astype(np.int32)
         else:
-            clip_offsets = np.zeros((self.num_clips, ), dtype=np.int32)
+            clip_offsets = np.array([max_offset // 2])
         return clip_offsets
 
     def _sample_clips(self, num_frames):

diff --git a/mmaction/models/backbones/__init__.py b/mmaction/models/backbones/__init__.py
@@ -16,11 +16,12 @@
 from .swin import SwinTransformer3D
 from .tanet import TANet
 from .timesformer import TimeSformer
+from .vit_mae import VisionTransformer
 from .x3d import X3D
 
 __all__ = [
     'C3D', 'ResNet', 'ResNet3d', 'ResNetTSM', 'ResNet2Plus1d',
     'ResNet3dSlowFast', 'ResNet3dSlowOnly', 'ResNet3dCSN', 'ResNetTIN', 'X3D',
     'ResNet3dLayer', 'MobileNetV2TSM', 'MobileNetV2', 'TANet', 'TimeSformer',
-    'STGCN', 'AGCN', 'ResNetAudio', 'SwinTransformer3D'
+    'STGCN', 'AGCN', 'ResNetAudio', 'SwinTransformer3D', 'VisionTransformer'
 ]