New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support VideoMAE #1942
Merged
Merged
Changes from all commits
Commits
Show all changes
130 commits
Select commit
Hold shift + click to select a range
92c1596
[Refactor] update slowfast (#1880)
Dai-Wenxun b5f248a
[Fix] fix readme (#1882)
Dai-Wenxun 0e7e57c
[Refactor] rename posec3d related config files (#1883)
cir7 7eb01b8
[Refactor] rename 2s-agcn related config files (#1879)
cir7 a3aefb9
[Refactor] update timesformer (#1884)
Dai-Wenxun dd032a2
[Refactor] update x3d configs (#1892)
Dai-Wenxun cf6e570
fix typo (#1895)
Dai-Wenxun 023579f
update trn (#1888)
Dai-Wenxun 9aa8171
[Fix] Fix configs and README for SlowOnly (#1889)
hukkai 3bf2890
[Fix] Fix README for TSM (#1887)
hukkai ce70e23
[Fix] fix config download links (#1890)
cir7 361d331
[Fix] fix audio readme (#1898)
Dai-Wenxun af5495b
[Refactor] rename stgcn realted config files (#1891)
cir7 9580d95
fix posec3d readme and metafile (#1899)
Dai-Wenxun 6c7cbe5
[Fix] fix typo in TIN config and slurm_train (#1904)
cir7 b003f93
[Fix] Fix configs for detection (#1903)
hukkai de658ce
[Fix] Fix configs for TSN (#1905)
hukkai a15b934
[CI] Fix CI for dev-1.x (#1923)
cir7 2485566
[Demo] Support Skeleton demo (#1920)
Dai-Wenxun 9f833de
[Fix] Fix urls in trn and i3d (#1925)
Dai-Wenxun 8da0d01
[Refactor] update 2sagcn readme (#1915)
Dai-Wenxun 439bfce
[Refactor] update tanet readme (#1916)
Dai-Wenxun a7bafb5
first commit
Dai-Wenxun 319eaf0
fix lint
Dai-Wenxun 9f2ed25
fix constructor
Dai-Wenxun 6441419
fix lint
Dai-Wenxun ee906df
remove timm dependencies
Dai-Wenxun d8103fb
fix lint
Dai-Wenxun 7e39095
replace type hint: Tensor->torch.Tensor
Dai-Wenxun 32a9a3d
fix lint
Dai-Wenxun 2cc2245
support video mae
hukkai 0a5872f
support video mae
hukkai 19f288c
support video mae
hukkai 3ac55f0
support video mae
hukkai 5081578
[Fix] fix a bug in UT (#1937)
hukkai 2a106cf
support video mae
hukkai 5ab247e
support video mae
hukkai a55f465
support video mae
hukkai 7aa069e
support video mae
hukkai ca4502f
support video mae
hukkai 40dcbea
support video mae
hukkai da02f31
support video mae
hukkai 91f5f29
support video mae
hukkai c91cd4b
add auto_scale_lr and file_client_args
Dai-Wenxun 59386c9
add swin-l k700
Dai-Wenxun 25d2491
fix lint
Dai-Wenxun b0e2b73
fix bug
Dai-Wenxun 586f400
fix lint
Dai-Wenxun 49d08d2
support video mae
hukkai f6dc205
support video mae
hukkai afe7aa9
[Fix] fix a bug in UT (#1937) (#2)
hukkai 31bc669
support video mae
hukkai 0b036e5
support video mae
hukkai 415d654
support video mae
hukkai 901be04
support video mae
hukkai 0237f47
support video mae
hukkai eae1bbb
Update vit_mae-pretrained-vit-base_16x4x1_kinetics-400.py
hukkai 6f8718a
[Refactor] update TPN readme (#1927)
cir7 b292feb
[Refactor] remove onnx related tools (#1928)
cir7 24e0eab
[Doc] update migration doc (#1931)
cir7 d83303e
[Doc] fix link in data_prepare.md (#1944)
cir7 23ce7af
support video mae
hukkai bafcabe
update
hukkai 1adff3c
[Fix] fix BSN and BMN configs for localization (#1913)
hukkai e6045ed
modify stgcn-ntu60 (#1914)
Dai-Wenxun 88e6946
[Refactor] update TIN readme (#1926)
cir7 f4ac064
fix review
hukkai 0a5342f
fix review
hukkai 5cc687e
optimize backbone
Dai-Wenxun 7451bd3
optimize backbone
Dai-Wenxun 7b469cc
modify configs
Dai-Wenxun fd990b1
fix urls
Dai-Wenxun dddb2e3
fix lint
Dai-Wenxun 6c7f131
fix bug
Dai-Wenxun fc8ceea
fix bugs in cls_head
Dai-Wenxun 7cd5f0c
Dev 1.x (#4)
hukkai 3c63b72
rebase
hukkai 69f31b7
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun 286499e
fix bug in swin-small
Dai-Wenxun e2349f8
Merge branch 'swin-3d' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun 8dfbcb4
[Fix] fix ut for bmn and bsn (#1966)
cir7 5e347c9
[Fix] fix wrong config of warmup schedule in TIN config (#1912)
cir7 e54c99a
[Refactor] rename imagenet-pretrained-r50 => r50-in1k-pre (#1951)
cir7 c8e11ce
[CI] add coverage test on cuda device (#1930)
cir7 4bf822e
[Fix] fix ckpt and log links (#1967)
cir7 320c91b
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun d4d4af2
update readme for k700 results
Dai-Wenxun 7d79177
fix name of norm
Dai-Wenxun 8513f0f
update readme
Dai-Wenxun 97eee72
modify docstring
Dai-Wenxun 5043905
add code for ut
Dai-Wenxun fb23b28
set x86_64 as required args
Dai-Wenxun 91df018
merge
Dai-Wenxun d314416
Merge branch 'swin-3d' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun bb379a3
add ut
Dai-Wenxun ca75881
modify file_client_args
Dai-Wenxun 2b6a5fe
fix ut
Dai-Wenxun cde79c8
add ut
Dai-Wenxun f7eaeb9
fix lint
Dai-Wenxun f6838df
fix lint
Dai-Wenxun bfa9d56
fix lint (#1971)
Dai-Wenxun d22e781
Merge branch 'dev-1.x' of https://github.com/open-mmlab/mmaction2 int…
Dai-Wenxun 9ef73bf
Merge branch 'dev-1.x' into swin-3d
Dai-Wenxun f82e9de
fix lint
Dai-Wenxun 2a19ef7
update params in README
Dai-Wenxun 974a751
[Feature] Support Video Swin Transfomer (#1939)
Dai-Wenxun 1e3d1de
[CI] fix timm related bug (#1976)
cir7 c131581
add metafile
hukkai 14bf9b4
Merge branch 'dev-1.x' into video-mae
hukkai 1824640
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun 8fc6f9d
[Fix] update mmengine version restriction (#1987)
hukkai fdf6672
add colab tutorial (#1956)
hukkai 991811d
fix k700 (#1986)
hukkai 0620de7
Merge branch 'dev-1.x' of https://github.com/Dai-Wenxun/mmaction2 int…
Dai-Wenxun 29b9ea1
Merge branch 'dev-1.x' of https://github.com/open-mmlab/mmaction2 int…
Dai-Wenxun 68021bd
Merge branch 'video-mae' of https://github.com/hukkai/mmaction2 into …
Dai-Wenxun beb580d
Delete videomae-pretrained-vit-base_16x4x1_kinetics-400.py
hukkai a19bd3f
Delete videomae-pretrained-vit-large_16x4x1_kinetics-400.py
hukkai d454f90
fix sampleframe in test mode
hukkai 519c337
fix sampleframe in test mode
hukkai c20dfa6
add flops
hukkai 75249a5
add flops
hukkai a8cd2a5
add flops
hukkai efa7372
add flops
hukkai b0f8beb
fix sample frames
hukkai 89061e3
fix sample frames
hukkai f3f0bb6
fix sample frames
hukkai 33d8290
fix
Dai-Wenxun e574430
Merge remote-tracking branch 'upstream/dev-1.x' into video-mae
Dai-Wenxun 50c6a7b
Merge branch 'dev-1.x' into video-mae
Dai-Wenxun File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# VideoMAE | ||
|
||
[VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
<!-- [ABSTRACT] --> | ||
|
||
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 85.8% on Kinetics-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51, without using any extra data. | ||
|
||
<!-- [IMAGE] --> | ||
|
||
<div align=center> | ||
<img src="https://user-images.githubusercontent.com/35267818/191656296-14f28f4a-203f-4eeb-a4c3-c2efdb6d1ab4.png" width="800"/> | ||
</div> | ||
|
||
## Results and Models | ||
|
||
### Kinetics-400 | ||
|
||
| frame sampling strategy | resolution | backbone | top1 acc | top5 acc | reference top1 acc | reference top5 acc | testing protocol | FLOPs | params | config | ckpt | | ||
| :---------------------: | :------------: | :------: | :------: | :------: | :--------------------------------: | :--------------------------------: | :---------------: | :---: | :----: | :--------------------: | :-------------------: | | ||
| 16x4x1 | short-side 320 | ViT-B | 81.3 | 95.0 | 81.5 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 95.1 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 5 clips x 3 crops | 180G | 87M | [config](/configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-860a3cd3.pth) \[1\] | | ||
| 16x4x1 | short-side 320 | ViT-L | 85.3 | 96.7 | 85.2 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 96.8 \[[VideoMAE](https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md)\] | 5 clips x 3 crops | 597G | 305M | [config](/configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth) \[1\] | | ||
|
||
\[1\] The models are ported from the repo [VideoMAE](https://github.com/MCG-NJU/VideoMAE) and tested on our data. Currently, we only support the testing of VideoMAE models, training will be available soon. | ||
|
||
1. The values in columns named after "reference" are the results of the original repo. | ||
2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available. | ||
|
||
For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md). | ||
|
||
## Test | ||
|
||
You can use the following command to test a model. | ||
|
||
```shell | ||
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments] | ||
``` | ||
|
||
Example: test ViT-base model on Kinetics-400 dataset and dump the result to a pkl file. | ||
|
||
```shell | ||
python tools/test.py configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py \ | ||
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl | ||
``` | ||
|
||
For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md). | ||
|
||
## Citation | ||
|
||
```BibTeX | ||
@misc{feichtenhofer2020x3d, | ||
title={X3D: Expanding Architectures for Efficient Video Recognition}, | ||
author={Christoph Feichtenhofer}, | ||
year={2020}, | ||
eprint={2004.04730}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CV} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
Collections: | ||
- Name: VideoMAE | ||
README: configs/recognition/videomae/README.md | ||
Paper: | ||
URL: https://arxiv.org/abs/2203.12602 | ||
Title: "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training" | ||
|
||
Models: | ||
- Name: vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400 | ||
Config: configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py | ||
In Collection: VideoMAE | ||
Metadata: | ||
Architecture: ViT-B | ||
Resolution: short-side 320 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md | ||
Code: https://github.com/MCG-NJU/VideoMAE/ | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 81.3 | ||
Top 5 Accuracy: 95.0 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-860a3cd3.pth | ||
|
||
- Name: vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400 | ||
Config: configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py | ||
In Collection: VideoMAE | ||
Metadata: | ||
Architecture: ViT-L | ||
Resolution: short-side 320 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/MCG-NJU/VideoMAE/blob/main/MODEL_ZOO.md | ||
Code: https://github.com/MCG-NJU/VideoMAE/ | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 85.3 | ||
Top 5 Accuracy: 96.7 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400_20221013-229dbb03.pth |
61 changes: 61 additions & 0 deletions
61
configs/recognition/videomae/vit-base-p16_videomae-k400-pre_16x4x1_kinetics-400.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
_base_ = ['../../_base_/default_runtime.py'] | ||
|
||
# model settings | ||
model = dict( | ||
type='Recognizer3D', | ||
backbone=dict( | ||
type='VisionTransformer', | ||
img_size=224, | ||
patch_size=16, | ||
embed_dims=768, | ||
depth=12, | ||
num_heads=12, | ||
mlp_ratio=4, | ||
qkv_bias=True, | ||
num_frames=16, | ||
norm_cfg=dict(type='LN', eps=1e-6)), | ||
cls_head=dict( | ||
type='TimeSformerHead', | ||
num_classes=400, | ||
in_channels=768, | ||
average_clips='prob'), | ||
data_preprocessor=dict( | ||
type='ActionDataPreprocessor', | ||
mean=[123.675, 116.28, 103.53], | ||
std=[58.395, 57.12, 57.375], | ||
format_shape='NCTHW')) | ||
|
||
# dataset settings | ||
dataset_type = 'VideoDataset' | ||
data_root_val = 'data/kinetics400/videos_val' | ||
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt' | ||
|
||
test_pipeline = [ | ||
dict(type='DecordInit'), | ||
dict( | ||
type='SampleFrames', | ||
clip_len=16, | ||
frame_interval=4, | ||
num_clips=5, | ||
test_mode=True), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 224)), | ||
dict(type='ThreeCrop', crop_size=224), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
|
||
test_dataloader = dict( | ||
batch_size=1, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_test, | ||
data_prefix=dict(video=data_root_val), | ||
pipeline=test_pipeline, | ||
test_mode=True)) | ||
|
||
test_evaluator = dict(type='AccMetric') | ||
test_cfg = dict(type='TestLoop') |
6 changes: 6 additions & 0 deletions
6
configs/recognition/videomae/vit-large-p16_videomae-k400-pre_16x4x1_kinetics-400.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
_base_ = ['vit-base_videomae-k400-pre_16x4x1_kinetics-400.py'] | ||
|
||
# model settings | ||
model = dict( | ||
backbone=dict(embed_dims=1024, depth=24, num_heads=16), | ||
cls_head=dict(in_channels=1024)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These descriptions are no longer needed?