Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support VideoMAE #1942

Merged
merged 130 commits into from Oct 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
92c1596
[Refactor] update slowfast (#1880)
Dai-Wenxun Sep 1, 2022
b5f248a
[Fix] fix readme (#1882)
Dai-Wenxun Sep 1, 2022
0e7e57c
[Refactor] rename posec3d related config files (#1883)
cir7 Sep 1, 2022
7eb01b8
[Refactor] rename 2s-agcn related config files (#1879)
cir7 Sep 1, 2022
a3aefb9
[Refactor] update timesformer (#1884)
Dai-Wenxun Sep 1, 2022
dd032a2
[Refactor] update x3d configs (#1892)
Dai-Wenxun Sep 2, 2022
cf6e570
fix typo (#1895)
Dai-Wenxun Sep 2, 2022
023579f
update trn (#1888)
Dai-Wenxun Sep 2, 2022
9aa8171
[Fix] Fix configs and README for SlowOnly (#1889)
hukkai Sep 2, 2022
3bf2890
[Fix] Fix README for TSM (#1887)
hukkai Sep 2, 2022
ce70e23
[Fix] fix config download links (#1890)
cir7 Sep 2, 2022
361d331
[Fix] fix audio readme (#1898)
Dai-Wenxun Sep 2, 2022
af5495b
[Refactor] rename stgcn realted config files (#1891)
cir7 Sep 5, 2022
9580d95
fix posec3d readme and metafile (#1899)
Dai-Wenxun Sep 5, 2022
6c7cbe5
[Fix] fix typo in TIN config and slurm_train (#1904)
cir7 Sep 5, 2022
b003f93
[Fix] Fix configs for detection (#1903)
hukkai Sep 8, 2022
de658ce
[Fix] Fix configs for TSN (#1905)
hukkai Sep 8, 2022
a15b934
[CI] Fix CI for dev-1.x (#1923)
cir7 Sep 13, 2022
2485566
[Demo] Support Skeleton demo (#1920)
Dai-Wenxun Sep 16, 2022
9f833de
[Fix] Fix urls in trn and i3d (#1925)
Dai-Wenxun Sep 16, 2022
8da0d01
[Refactor] update 2sagcn readme (#1915)
Dai-Wenxun Sep 19, 2022
439bfce
[Refactor] update tanet readme (#1916)
Dai-Wenxun Sep 19, 2022
a7bafb5
first commit
Dai-Wenxun Sep 20, 2022
319eaf0
fix lint
Dai-Wenxun Sep 20, 2022
9f2ed25
fix constructor
Dai-Wenxun Sep 20, 2022
6441419
fix lint
Dai-Wenxun Sep 20, 2022
ee906df
remove timm dependencies
Dai-Wenxun Sep 20, 2022
d8103fb
fix lint
Dai-Wenxun Sep 20, 2022
7e39095
replace type hint: Tensor->torch.Tensor
Dai-Wenxun Sep 20, 2022
32a9a3d
fix lint
Dai-Wenxun Sep 20, 2022
2cc2245
support video mae
hukkai Sep 21, 2022
0a5872f
support video mae
hukkai Sep 21, 2022
19f288c
support video mae
hukkai Sep 21, 2022
3ac55f0
support video mae
hukkai Sep 21, 2022
5081578
[Fix] fix a bug in UT (#1937)
hukkai Sep 21, 2022
2a106cf
support video mae
hukkai Sep 22, 2022
5ab247e
support video mae
hukkai Sep 22, 2022
a55f465
support video mae
hukkai Sep 22, 2022
7aa069e
support video mae
hukkai Sep 22, 2022
ca4502f
support video mae
hukkai Sep 22, 2022
40dcbea
support video mae
hukkai Sep 22, 2022
da02f31
support video mae
hukkai Sep 22, 2022
91f5f29
support video mae
hukkai Sep 22, 2022
c91cd4b
add auto_scale_lr and file_client_args
Dai-Wenxun Sep 22, 2022
59386c9
add swin-l k700
Dai-Wenxun Sep 22, 2022
25d2491
fix lint
Dai-Wenxun Sep 22, 2022
b0e2b73
fix bug
Dai-Wenxun Sep 22, 2022
586f400
fix lint
Dai-Wenxun Sep 22, 2022
49d08d2
support video mae
hukkai Sep 22, 2022
f6dc205
support video mae
hukkai Sep 22, 2022
afe7aa9
[Fix] fix a bug in UT (#1937) (#2)
hukkai Sep 22, 2022
31bc669
support video mae
hukkai Sep 22, 2022
0b036e5
support video mae
hukkai Sep 22, 2022
415d654
support video mae
hukkai Sep 22, 2022
901be04
support video mae
hukkai Sep 22, 2022
0237f47
support video mae
hukkai Sep 22, 2022
eae1bbb
Update vit_mae-pretrained-vit-base_16x4x1_kinetics-400.py
hukkai Sep 22, 2022
6f8718a
[Refactor] update TPN readme (#1927)
cir7 Sep 23, 2022
b292feb
[Refactor] remove onnx related tools (#1928)
cir7 Sep 23, 2022
24e0eab
[Doc] update migration doc (#1931)
cir7 Sep 23, 2022
d83303e
[Doc] fix link in data_prepare.md (#1944)
cir7 Sep 23, 2022
23ce7af
support video mae
hukkai Sep 23, 2022
bafcabe
update
hukkai Sep 26, 2022
1adff3c
[Fix] fix BSN and BMN configs for localization (#1913)
hukkai Sep 27, 2022
e6045ed
modify stgcn-ntu60 (#1914)
Dai-Wenxun Sep 27, 2022
88e6946
[Refactor] update TIN readme (#1926)
cir7 Sep 27, 2022
f4ac064
fix review
hukkai Sep 28, 2022
0a5342f
fix review
hukkai Sep 28, 2022
5cc687e
optimize backbone
Dai-Wenxun Sep 29, 2022
7451bd3
optimize backbone
Dai-Wenxun Sep 29, 2022
7b469cc
modify configs
Dai-Wenxun Sep 30, 2022
fd990b1
fix urls
Dai-Wenxun Sep 30, 2022
dddb2e3
fix lint
Dai-Wenxun Sep 30, 2022
6c7f131
fix bug
Dai-Wenxun Sep 30, 2022
fc8ceea
fix bugs in cls_head
Dai-Wenxun Sep 30, 2022
7cd5f0c
Dev 1.x (#4)
hukkai Sep 30, 2022
3c63b72
rebase
hukkai Sep 30, 2022
69f31b7
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun Sep 30, 2022
286499e
fix bug in swin-small
Dai-Wenxun Sep 30, 2022
e2349f8
Merge branch 'swin-3d' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun Sep 30, 2022
8dfbcb4
[Fix] fix ut for bmn and bsn (#1966)
cir7 Sep 30, 2022
5e347c9
[Fix] fix wrong config of warmup schedule in TIN config (#1912)
cir7 Sep 30, 2022
e54c99a
[Refactor] rename imagenet-pretrained-r50 => r50-in1k-pre (#1951)
cir7 Sep 30, 2022
c8e11ce
[CI] add coverage test on cuda device (#1930)
cir7 Sep 30, 2022
4bf822e
[Fix] fix ckpt and log links (#1967)
cir7 Sep 30, 2022
320c91b
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun Oct 1, 2022
d4d4af2
update readme for k700 results
Dai-Wenxun Oct 1, 2022
7d79177
fix name of norm
Dai-Wenxun Oct 2, 2022
8513f0f
update readme
Dai-Wenxun Oct 2, 2022
97eee72
modify docstring
Dai-Wenxun Oct 3, 2022
5043905
add code for ut
Dai-Wenxun Oct 3, 2022
fb23b28
set x86_64 as required args
Dai-Wenxun Oct 3, 2022
91df018
merge
Dai-Wenxun Oct 3, 2022
d314416
Merge branch 'swin-3d' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun Oct 3, 2022
bb379a3
add ut
Dai-Wenxun Oct 3, 2022
ca75881
modify file_client_args
Dai-Wenxun Oct 7, 2022
2b6a5fe
fix ut
Dai-Wenxun Oct 7, 2022
cde79c8
add ut
Dai-Wenxun Oct 7, 2022
f7eaeb9
fix lint
Dai-Wenxun Oct 7, 2022
f6838df
fix lint
Dai-Wenxun Oct 7, 2022
bfa9d56
fix lint (#1971)
Dai-Wenxun Oct 8, 2022
d22e781
Merge branch 'dev-1.x' of https://github.com/open-mmlab/mmaction2 int…
Dai-Wenxun Oct 8, 2022
9ef73bf
Merge branch 'dev-1.x' into swin-3d
Dai-Wenxun Oct 8, 2022
f82e9de
fix lint
Dai-Wenxun Oct 8, 2022
2a19ef7
update params in README
Dai-Wenxun Oct 9, 2022
974a751
[Feature] Support Video Swin Transfomer (#1939)
Dai-Wenxun Oct 11, 2022
1e3d1de
[CI] fix timm related bug (#1976)
cir7 Oct 11, 2022
c131581
add metafile
hukkai Oct 13, 2022
14bf9b4
Merge branch 'dev-1.x' into video-mae
hukkai Oct 13, 2022
1824640
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun Oct 14, 2022
8fc6f9d
[Fix] update mmengine version restriction (#1987)
hukkai Oct 14, 2022
fdf6672
add colab tutorial (#1956)
hukkai Oct 14, 2022
991811d
fix k700 (#1986)
hukkai Oct 14, 2022
0620de7
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun Oct 14, 2022
29b9ea1
Merge branch 'dev-1.x' of https://github.com/open-mmlab/mmaction2 int…
Dai-Wenxun Oct 14, 2022
68021bd
Merge branch 'video-mae' of https://github.com/hukkai/mmaction2 into …
Dai-Wenxun Oct 17, 2022
beb580d
Delete videomae-pretrained-vit-base_16x4x1_kinetics-400.py
hukkai Oct 17, 2022
a19bd3f
Delete videomae-pretrained-vit-large_16x4x1_kinetics-400.py
hukkai Oct 17, 2022
d454f90
fix sampleframe in test mode
hukkai Oct 20, 2022
519c337
fix sampleframe in test mode
hukkai Oct 20, 2022
c20dfa6
add flops
hukkai Oct 21, 2022
75249a5
add flops
hukkai Oct 21, 2022
a8cd2a5
add flops
hukkai Oct 21, 2022
efa7372
add flops
hukkai Oct 21, 2022
b0f8beb
fix sample frames
hukkai Oct 21, 2022
89061e3
fix sample frames
hukkai Oct 21, 2022
f3f0bb6
fix sample frames
hukkai Oct 21, 2022
33d8290
fix
Dai-Wenxun Oct 25, 2022
e574430
Merge remote-tracking branch 'upstream/dev-1.x' into video-mae
Dai-Wenxun Oct 25, 2022
50c6a7b
Merge branch 'dev-1.x' into video-mae
Dai-Wenxun Oct 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
63 changes: 63 additions & 0 deletions configs/recognition/videomae/README.md
@@ -0,0 +1,63 @@
# VideoMAE

[VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602)

<!-- [ALGORITHM] -->

## Abstract

<!-- [ABSTRACT] -->

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 85.8% on Kinetics-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51, without using any extra data.

<!-- [IMAGE] -->

<div align=center>
<img src="https://user-images.githubusercontent.com/35267818/191656296-14f28f4a-203f-4eeb-a4c3-c2efdb6d1ab4.png" width="800"/>
</div>

## Results and Models

### Kinetics-400

| frame sampling strategy | resolution | backbone | top1 acc | top5 acc | reference top1 acc | reference top5 acc | testing protocol | FLOPs | params | config | ckpt |
| :---------------------: | :------------: | :------: | :------: | :------: | :--------------------------------: | :--------------------------------: | :---------------: | :---: | :----: | :--------------------: | :-------------------: |
| 16x4x1 | short-side 320 | ViT-B | 81.3 | 95.0 | 81.5 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 95.1 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 5 clips x 3 crops | 180G | 87M | [config](/configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-860a3cd3.pth) \[1\] |
| 16x4x1 | short-side 320 | ViT-L | 85.3 | 96.7 | 85.2 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 96.8 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 5 clips x 3 crops | 597G | 305M | [config](/configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth) \[1\] |

\[1\] The models are ported from the repo [VideoMAE](https://github.com/MCG-NJU/VideoMAE) and tested on our data. Currently, we only support the testing of VideoMAE models, training will be available soon.

1. The values in columns named after "reference" are the results of the original repo.
2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.

For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md).

## Test

You can use the following command to test a model.

```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```

Example: test ViT-base model on Kinetics-400 dataset and dump the result to a pkl file.

```shell
python tools/test.py configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py \
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).

## Citation

```BibTeX
@misc{feichtenhofer2020x3d,
title={X3D: Expanding Architectures for Efficient Video Recognition},
author={Christoph Feichtenhofer},
year={2020},
eprint={2004.04730},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
43 changes: 43 additions & 0 deletions configs/recognition/videomae/metafile.yml
@@ -0,0 +1,43 @@
Collections:
- Name: VideoMAE
README: configs/recognition/videomae/README.md
Paper:
URL: https://arxiv.org/abs/2203.12602
Title: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training"

Models:
- Name: vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400
Config: configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py
In Collection: VideoMAE
Metadata:
Architecture: ViT-B
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md
Code: https://github.com/MCG-NJU/VideoMAE/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.3
Top 5 Accuracy: 95.0
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-860a3cd3.pth

- Name: vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400
Config: configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py
In Collection: VideoMAE
Metadata:
Architecture: ViT-L
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md
Code: https://github.com/MCG-NJU/VideoMAE/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 85.3
Top 5 Accuracy: 96.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth
@@ -0,0 +1,61 @@
_base_ = ['../../_base_/default_runtime.py']

# model settings
model = dict(
type='Recognizer3D',
backbone=dict(
type='VisionTransformer',
img_size=224,
patch_size=16,
embed_dims=768,
depth=12,
num_heads=12,
mlp_ratio=4,
qkv_bias=True,
num_frames=16,
norm_cfg=dict(type='LN', eps=1e-6)),
cls_head=dict(
type='TimeSformerHead',
num_classes=400,
in_channels=768,
average_clips='prob'),
data_preprocessor=dict(
type='ActionDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
format_shape='NCTHW'))

# dataset settings
dataset_type = 'VideoDataset'
data_root_val = 'data/kinetics400/videos_val'
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'

test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=16,
frame_interval=4,
num_clips=5,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]

test_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=dict(video=data_root_val),
pipeline=test_pipeline,
test_mode=True))

test_evaluator = dict(type='AccMetric')
test_cfg = dict(type='TestLoop')
@@ -0,0 +1,6 @@
_base_ = ['vit-base_videomae-k400-pre_16x4x1_kinetics-400.py']

# model settings
model = dict(
backbone=dict(embed_dims=1024, depth=24, num_heads=16),
cls_head=dict(in_channels=1024))
23 changes: 12 additions & 11 deletions mmaction/datasets/transforms/loading.py
Expand Up @@ -174,9 +174,7 @@ def _get_train_clips(self, num_frames):
def _get_test_clips(self, num_frames):
"""Get clip offsets in test mode.

Calculate the average interval for selected frames, and shift them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions are no longer needed?

fixedly by avg_interval/2. If set twice_sample True, it will sample
frames together without fixed shift. If the total number of frames is
If the total number of frames is
not enough, it will return all zero indices.

Args:
Expand All @@ -185,15 +183,18 @@ def _get_test_clips(self, num_frames):
Returns:
np.ndarray: Sampled frame indices in test mode.
"""
ori_clip_len = self.clip_len * self.frame_interval
avg_interval = (num_frames - ori_clip_len + 1) / float(self.num_clips)
if num_frames > ori_clip_len - 1:
base_offsets = np.arange(self.num_clips) * avg_interval
clip_offsets = (base_offsets + avg_interval / 2.0).astype(np.int32)
if self.twice_sample:
clip_offsets = np.concatenate([clip_offsets, base_offsets])
k = 2 if self.twice_sample else 1
num_clips = self.num_clips * k
ori_clip_len = (self.clip_len - 1) * self.frame_interval + 1
max_offset = max(num_frames - ori_clip_len, 0)

if num_clips > 1:
num_segments = num_clips - 1
offset_between = max_offset / float(num_segments)
clip_offsets = np.arange(num_clips) * offset_between
clip_offsets = np.round(clip_offsets).astype(np.int32)
else:
clip_offsets = np.zeros((self.num_clips, ), dtype=np.int32)
clip_offsets = np.array([max_offset // 2])
return clip_offsets

def _sample_clips(self, num_frames):
Expand Down
3 changes: 2 additions & 1 deletion mmaction/models/backbones/__init__.py
Expand Up @@ -16,11 +16,12 @@
from .swin import SwinTransformer3D
from .tanet import TANet
from .timesformer import TimeSformer
from .vit_mae import VisionTransformer
from .x3d import X3D

__all__ = [
'C3D', 'ResNet', 'ResNet3d', 'ResNetTSM', 'ResNet2Plus1d',
'ResNet3dSlowFast', 'ResNet3dSlowOnly', 'ResNet3dCSN', 'ResNetTIN', 'X3D',
'ResNet3dLayer', 'MobileNetV2TSM', 'MobileNetV2', 'TANet', 'TimeSformer',
'STGCN', 'AGCN', 'ResNetAudio', 'SwinTransformer3D'
'STGCN', 'AGCN', 'ResNetAudio', 'SwinTransformer3D', 'VisionTransformer'
]