Skip to content

Commit

Permalink
[Docs] Update Mask2Former README.md. (#708)
Browse files Browse the repository at this point in the history
* add doc

* update doc

* update init_cfg

* update doc and fix vis

* update model_zoo

* add link

* fix conflict

* Update README.md

Co-authored-by: Tao Gong <gt950513@mail.ustc.edu.cn>
  • Loading branch information
Pengxiang Li and GT9505 committed Sep 13, 2022
1 parent d4019cf commit c889761
Show file tree
Hide file tree
Showing 7 changed files with 204 additions and 2 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ The master branch works with **PyTorch1.6+**.

## What's New

Release [StrongSORT](configs/mot/strongsort) pretrained models.
Release [Mask2Former](configs/vis/mask2former) pretrained models.

v1.0.0rc0 was released in 31/08/2022.
Please refer to [changelog.md](docs/en/changelog.md) for details and release history.
Expand Down Expand Up @@ -126,6 +126,7 @@ Supported Datasets
Supported Methods

- [x] [MaskTrack R-CNN](configs/vis/masktrack_rcnn) (ICCV 2019)
- [x] [Mask2Former](configs/vis/mask2former) (CVPR 2022)

Supported Datasets

Expand Down
78 changes: 78 additions & 0 deletions configs/vis/mask2former/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Mask2Former for Video Instance Segmentation

## Abstract

<!-- [ABSTRACT] -->

We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouTubeVIS-2019 and 52.6 AP on YouTubeVIS-2021. We believe Mask2Former is also capable of handling video semantic and panoptic segmentation, given its versatility in image segmentation. We hope this will make state-of-theart video segmentation research more accessible and bring more attention to designing universal image and video segmentation architectures.

<!-- [IMAGE] -->

<div align="center">
<img src="https://user-images.githubusercontent.com/46072190/188271377-164634a5-4d65-4161-8a69-2d0eaf2791f8.png"/>
</div>

## Citation

<!-- [ALGORITHM] -->

```latex
@inproceedings{cheng2021mask2former,
title={Masked-attention Mask Transformer for Universal Image Segmentation},
author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
journal={CVPR},
year={2022}
}
```

## Results and models of Mask2Former on YouTube-VIS 2021 validation dataset

Note: Codalab has closed the evaluation portal of `YouTube-VIS 2019`, so we do not provide the results of `YouTube-VIS 2019` at present. If you want to evaluate the results of `YouTube-VIS 2021`, at present, you can submit the result to the evaluation portal of `YouTube-VIS 2022`. The value of `AP_S` is the result of `YouTube-VIS 2021`.

| Method | Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | AP | Config | Download |
| :----------------------: | :------: | :-----: | :-----: | :------: | :------------: | :--: | :---------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Mask2Former | R-50 | pytorch | 8e | 6.0 | - | 41.2 | [config](mask2former_r50_8xb2-8e_youtubevis2021.py) | [model](https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2021_20220818_164043-1cab1219.pth) \| [log](https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2021_20220818_164043.json) |
| Mask2Former | R-101 | pytorch | 8e | 7.5 | - | 42.3 | [config](mask2former_r101_8xb2-8e_youtubevis2021.py) | [model](https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_r101_8xb2-8e_youtubevis2021_20220823_092747-b7a7d7cc.pth) \| [log](https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_r101_8xb2-8e_youtubevis2021_20220823_092747.json) |
| Mask2Former(200 queries) | Swin-L | pytorch | 8e | 18.5 | - | 52.3 | [config](mask2former_swin-l-p4-w12-384-in21k_8xb2-8e_youtubevis2021.py) | [model](https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_swin-l-p4-w12-384-in21k_8xb2-8e_youtubevis2021_20220907_124752-c04b720e.pth) \| [log](https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_swin-l-p4-w12-384-in21k_8xb2-8e_youtubevis2021_20220907_124752.json) |

## Get started

### 1. Training

Due to the influence of parameters such as learning rate in default configuration file, we recommend using 8 GPUs for training in order to reproduce accuracy. You can use the following command to start the training.

```shell
# Training Mask2Former on YouTube-VIS-2019 dataset with following command.
# The number after config file represents the number of GPUs used. Here we use 8 GPUs.
./tools/dist_train.sh \
configs/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2019.py 8
```

If you want to know about more detailed usage of `train.py/dist_train.sh/slurm_train.sh`, please refer to this [document](../../../docs/en/user_guides/4_train_test.md).

### 2. Testing and evaluation

If you want to get the results of the [YouTube-VOS](https://youtube-vos.org/dataset/vis/) val/test set, please use the following command to generate result files that can be used for submission. It will be stored in `./youtube_vis_results.submission_file.zip`, you can modify the saved path in `test_evaluator` of the config.

```shell
# The number after config file represents the number of GPUs used.
./tools/dist_test.sh \
configs/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2019.py 8 \
--checkpoint ./checkpoints/xxx
```

If you want to know about more detailed usage of `test.py/dist_test.sh/slurm_test.sh`, please refer to this [document](../../../docs/en/user_guides/4_train_test.md).

### 3.Inference

Use a single GPU to predict a video and save it as a video.

```shell
python demo/demo_mot_vis.py \
configs/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2019.py \
--checkpoint ./checkpoints/xxx \
--input demo/demo.mp4 \
--output vis.mp4
```

If you want to know about more detailed usage of `demo_mot_vis.py`, please refer to this [document](../../../docs/en/user_guides/3_inference.md).
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
_base_ = ['./mask2former_r50_8xb2-8e_youtubevis2021.py']
depths = [2, 2, 18, 2]
model = dict(
type='Mask2Former',
backbone=dict(
_delete_=True,
type='mmdet.SwinTransformer',
pretrain_img_size=384,
embed_dims=192,
depths=depths,
num_heads=[6, 12, 24, 48],
window_size=12,
mlp_ratio=4,
qkv_bias=True,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.3,
patch_norm=True,
out_indices=(0, 1, 2, 3),
with_cp=False,
convert_weights=True,
frozen_stages=-1,
init_cfg=None),
track_head=dict(
type='Mask2FormerHead',
in_channels=[192, 384, 768, 1536],
num_queries=200),
init_cfg=dict(
type='Pretrained',
checkpoint= # noqa: E251
'https://download.openmmlab.com/mmdetection/v2.0/mask2former/'
'mask2former_swin-l-p4-w12-384-in21k_lsj_16x1_100e_coco-panoptic/'
'mask2former_swin-l-p4-w12-384-in21k_lsj_16x1_100e_coco-panoptic_'
'20220407_104949-d4919c44.pth'))

# set all layers in backbone to lr_mult=0.1
# set all norm layers, position_embeding,
# query_embeding, level_embeding to decay_multi=0.0
backbone_norm_multi = dict(lr_mult=0.1, decay_mult=0.0)
backbone_embed_multi = dict(lr_mult=0.1, decay_mult=0.0)
embed_multi = dict(lr_mult=1.0, decay_mult=0.0)
custom_keys = {
'backbone': dict(lr_mult=0.1, decay_mult=1.0),
'backbone.patch_embed.norm': backbone_norm_multi,
'backbone.norm': backbone_norm_multi,
'absolute_pos_embed': backbone_embed_multi,
'relative_position_bias_table': backbone_embed_multi,
'query_embed': embed_multi,
'query_feat': embed_multi,
'level_embed': embed_multi
}
custom_keys.update({
f'backbone.stages.{stage_id}.blocks.{block_id}.norm': backbone_norm_multi
for stage_id, num_blocks in enumerate(depths)
for block_id in range(num_blocks)
})
custom_keys.update({
f'backbone.stages.{stage_id}.downsample.norm': backbone_norm_multi
for stage_id in range(len(depths) - 1)
})
# optimizer
optim_wrapper = dict(
paramwise_cfg=dict(custom_keys=custom_keys, norm_decay_mult=0.0))
53 changes: 53 additions & 0 deletions configs/vis/mask2former/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
Collections:
- Name: Mask2Former
Metadata:
Training Techniques:
- AdamW
- Weight Decay
Training Resources: 8x A100 GPUs
Architecture:
- Mask2Former
Paper:
URL: https://arxiv.org/pdf/2112.10764.pdf
Title: Mask2Former for Video Instance Segmentation
README: configs/vis/mask2former/README.md

Models:
- Name: mask2former_r50_8xb2-8e_youtubevis2021
In Collection: Mask2Former
Config: configs/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2021.py
Metadata:
Training Data: YouTube-VIS 2021
Training Memory (GB): 6.0
Results:
- Task: Video Instance Segmentation
Dataset: YouTube-VIS 2021
Metrics:
AP: 41.2
Weights: https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_r50_8xb2-8e_youtubevis2021_20220818_164043-1cab1219.pth

- Name: mask2former_r101_8xb2-8e_youtubevis2021
In Collection: Mask2Former
Config: configs/vis/mask2former/mask2former_r101_8xb2-8e_youtubevis2021.py
Metadata:
Training Data: YouTube-VIS 2021
Training Memory (GB): 7.5
Results:
- Task: Video Instance Segmentation
Dataset: YouTube-VIS 2021
Metrics:
AP: 42.3
Weights: https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_r101_8xb2-8e_youtubevis2021_20220823_092747-b7a7d7cc.pth

- Name: mask2former_swin-l-p4-w12-384-in21k_8xb2-8e_youtubevis2021.py
In Collection: Mask2Former
Config: configs/vis/mask2former/mask2former_swin-l-p4-w12-384-in21k_8xb2-8e_youtubevis2021.py
Metadata:
Training Data: YouTube-VIS 2021
Training Memory (GB): 18.5
Results:
- Task: Video Instance Segmentation
Dataset: YouTube-VIS 2021
Metrics:
AP: 52.3
Weights: https://download.openmmlab.com/mmtracking/vis/mask2former/mask2former_swin-l-p4-w12-384-in21k_8xb2-8e_youtubevis2021_20220907_124752-c04b720e.pth
4 changes: 4 additions & 0 deletions docs/en/model_zoo.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,7 @@ Please refer to [STARK](https://github.com/open-mmlab/mmtracking/blob/master/con
### MaskTrack R-CNN (ICCV 2019)

Please refer to [MaskTrack R-CNN](https://github.com/open-mmlab/mmtracking/blob/master/configs/vis/masktrack_rcnn) for details.

### Mask2Former (CVPR 2022)

Please refer to [Mask2Former](https://github.com/open-mmlab/mmtracking/blob/master/configs/vis/mask2former) for details.
1 change: 1 addition & 0 deletions model-index.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ Import:
- configs/vid/selsa/metafile.yml
- configs/vid/temporal_roi_align/metafile.yml
- configs/vis/masktrack_rcnn/metafile.yml
- configs/vis/mask2former/metafile.yml
3 changes: 2 additions & 1 deletion tools/analysis_tools/browse_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ def main():
visualizer.add_datasample(
osp.basename(img_path),
img,
gt_sample=gt_sample,
data_sample=gt_sample,
draw_pred=False,
show=not args.not_show,
wait_time=args.show_interval,
out_file=out_file)
Expand Down

0 comments on commit c889761

Please sign in to comment.