Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Refator ViT (Continue #295) #395

Merged
merged 23 commits into from Oct 18, 2021
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5a87816
[Squash] Refator ViT (from #295)
mzr1996 Aug 5, 2021
b98a477
Use base variable to simplify auto_aug setting
mzr1996 Aug 10, 2021
7697d0f
Use common PatchEmbed, remove HybridEmbed and refactor ViT init
mzr1996 Aug 10, 2021
206eb84
Add `output_cls_token` option and change the output format of ViT and
mzr1996 Aug 10, 2021
0f83ea3
Update unit tests and add test for `output_cls_token`.
mzr1996 Aug 10, 2021
fb265e8
Support out_indices.
mzr1996 Aug 11, 2021
a504171
Standardize config files
mzr1996 Aug 18, 2021
a6ea15f
Support resize position embedding.
mzr1996 Aug 19, 2021
46050db
Add readme file of vit
mzr1996 Aug 20, 2021
23ae298
Rename config file
mzr1996 Aug 20, 2021
537836f
Improve docs about ViT.
mzr1996 Aug 23, 2021
170c100
Update docstring
mzr1996 Sep 9, 2021
37bd1a7
Use local version `MultiheadAttention` instead of mmcv version.
mzr1996 Sep 15, 2021
f37941f
Fix MultiheadAttention
mzr1996 Sep 15, 2021
d06a7a7
Support `qk_scale` argument in `MultiheadAttention`
mzr1996 Sep 26, 2021
9331762
Improve docs and change `layer_cfg` to `layer_cfgs` and support
mzr1996 Sep 28, 2021
bed85e3
Use init_cfg to init Linear layer in VisionTransformerHead
mzr1996 Sep 28, 2021
3a037bd
update metafile
mzr1996 Sep 28, 2021
213b172
Update checkpoints and configs
mzr1996 Sep 28, 2021
6c7f003
Imporve docstring.
mzr1996 Sep 28, 2021
f7d1956
Update README
mzr1996 Oct 15, 2021
2c309f6
Revert GAP modification.
mzr1996 Oct 15, 2021
d1a2f41
Merge remote-tracking branch 'origin/master' into refactor-vit
mzr1996 Oct 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
45 changes: 45 additions & 0 deletions configs/_base_/datasets/imagenet_bs64_pil_resize_autoaug.py
@@ -0,0 +1,45 @@
_base_ = [
'pipelines/auto_aug.py',
]

# dataset settings
dataset_type = 'ImageNet'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224, backend='pillow'),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='AutoAugment', policies={{_base_.policy_imagenet}}),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=(256, -1), backend='pillow'),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]
data = dict(
samples_per_gpu=64,
workers_per_gpu=2,
train=dict(
type=dataset_type,
data_prefix='data/imagenet/train',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
data_prefix='data/imagenet/val',
ann_file='data/imagenet/meta/val.txt',
pipeline=test_pipeline),
test=dict(
# replace `data/val` with `data/test` for standard test
type=dataset_type,
data_prefix='data/imagenet/val',
ann_file='data/imagenet/meta/val.txt',
pipeline=test_pipeline))
evaluation = dict(interval=1, metric='accuracy')
@@ -1,18 +1,13 @@
_base_ = [
'../_base_/models/vit_base_patch16_224_pretrain.py',
'../_base_/datasets/imagenet_bs64_pil_resize.py',
'../_base_/schedules/imagenet_bs4096_AdamW.py',
'../_base_/default_runtime.py'
]

policies = [
# Policy for ImageNet, refers to
# https://github.com/DeepVoltaire/AutoAugment/blame/master/autoaugment.py
policy_imagenet = [
[
dict(type='Posterize', bits=4, prob=0.4),
dict(type='Rotate', angle=30., prob=0.6)
],
[
dict(type='Solarize', thr=256 / 9 * 4, prob=0.6),
dict(type='AutoContrast', prob=0.5)
dict(type='AutoContrast', prob=0.6)
],
[dict(type='Equalize', prob=0.8),
dict(type='Equalize', prob=0.6)],
Expand Down Expand Up @@ -40,7 +35,7 @@
],
[
dict(type='Equalize', prob=0.6),
dict(type='Posterize', bits=5, prob=0.6)
dict(type='Posterize', bits=5, prob=0.4)
],
[
dict(type='Rotate', angle=30 / 9 * 8, prob=0.8),
Expand Down Expand Up @@ -99,45 +94,3 @@
[dict(type='Equalize', prob=0.8),
dict(type='Equalize', prob=0.6)],
]

dataset_type = 'ImageNet'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='AutoAugment', policies=policies),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=(256, -1), backend='pillow'),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]
data = dict(
samples_per_gpu=64,
workers_per_gpu=2,
train=dict(
type=dataset_type,
data_prefix='data/imagenet/train',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
data_prefix='data/imagenet/val',
ann_file='data/imagenet/meta/val.txt',
pipeline=test_pipeline),
test=dict(
# replace `data/val` with `data/test` for standard test
type=dataset_type,
data_prefix='data/imagenet/val',
ann_file='data/imagenet/meta/val.txt',
pipeline=test_pipeline))
evaluation = dict(interval=1, metric='accuracy')
25 changes: 25 additions & 0 deletions configs/_base_/models/vit-base-p16.py
@@ -0,0 +1,25 @@
# model settings
model = dict(
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
arch='b',
img_size=224,
patch_size=16,
drop_rate=0.1,
init_cfg=[
dict(
type='Kaiming',
layer='Conv2d',
mode='fan_in',
nonlinearity='linear')
]),
neck=None,
head=dict(
type='VisionTransformerClsHead',
num_classes=1000,
in_channels=768,
loss=dict(
type='LabelSmoothLoss', label_smooth_val=0.1,
mode='classy_vision'),
Ezra-Yu marked this conversation as resolved.
Show resolved Hide resolved
))
Expand Up @@ -3,14 +3,17 @@
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
num_layers=12,
embed_dim=768,
num_heads=12,
img_size=384,
arch='b',
img_size=224,
patch_size=32,
in_channels=3,
feedforward_channels=3072,
drop_rate=0.1),
drop_rate=0.1,
init_cfg=[
dict(
type='Kaiming',
layer='Conv2d',
mode='fan_in',
nonlinearity='linear')
]),
neck=None,
head=dict(
type='VisionTransformerClsHead',
Expand Down
Expand Up @@ -3,14 +3,17 @@
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
num_layers=24,
embed_dim=1024,
num_heads=16,
arch='l',
img_size=224,
patch_size=16,
in_channels=3,
feedforward_channels=4096,
drop_rate=0.1),
drop_rate=0.1,
init_cfg=[
dict(
type='Kaiming',
layer='Conv2d',
mode='fan_in',
nonlinearity='linear')
]),
neck=None,
head=dict(
type='VisionTransformerClsHead',
Expand Down
Expand Up @@ -3,14 +3,17 @@
type='ImageClassifier',
backbone=dict(
type='VisionTransformer',
num_layers=24,
embed_dim=1024,
num_heads=16,
img_size=384,
arch='l',
img_size=224,
patch_size=32,
in_channels=3,
feedforward_channels=4096,
drop_rate=0.1),
drop_rate=0.1,
init_cfg=[
dict(
type='Kaiming',
layer='Conv2d',
mode='fan_in',
nonlinearity='linear')
]),
neck=None,
head=dict(
type='VisionTransformerClsHead',
Expand Down
21 changes: 0 additions & 21 deletions configs/_base_/models/vit_base_patch16_224_finetune.py

This file was deleted.

26 changes: 0 additions & 26 deletions configs/_base_/models/vit_base_patch16_224_pretrain.py

This file was deleted.

21 changes: 0 additions & 21 deletions configs/_base_/models/vit_base_patch16_384_finetune.py

This file was deleted.

21 changes: 0 additions & 21 deletions configs/_base_/models/vit_large_patch16_384_finetune.py

This file was deleted.

33 changes: 33 additions & 0 deletions configs/vision_transformer/README.md
Expand Up @@ -14,3 +14,36 @@
url={https://openreview.net/forum?id=YicbFdNTTy}
}
```

mzr1996 marked this conversation as resolved.
Show resolved Hide resolved
The training step of Vision Transformers is divided into two steps. The first
step is training the model on a large dataset, like ImageNet-21k, and get the
pretrain model. And the second step is training the model on the target dataset,
like ImageNet-1k, and get the finetune model. Here, we provide both pretrain
models and finetune models.

## Pretrain model

The pre-trained models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).

### ImageNet 21k

| Model | Params(M) | Flops(G) | Download |
|:----------:|:---------:|:---------:|:--------:|
| ViT-B16\* | 86.86 | 33.03 | [model](https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-p16_3rdparty_pt-64xb64_in1k-224_20210928-02284250.pth)|
| ViT-B32\* | 88.30 | 8.56 | [model](https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-base-p32_3rdparty_pt-64xb64_in1k-224_20210928-eee25dd4.pth)|
| ViT-L16\* | 304.72 | 116.68 | [model](https://download.openmmlab.com/mmclassification/v0/vit/pretrain/vit-large-p16_3rdparty_pt-64xb64_in1k-224_20210928-0001f9a1.pth)|

*Models with \* are converted from other repos.*


## Finetune model

The finetune models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).

### ImageNet 1k
| Model | Pretrain | resolution | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
|:----------:|:------------:|:-----------:|:---------:|:---------:|:---------:|:---------:|:----------:|:--------:|
| ViT-B16\* | ImageNet-21k | 384x384 | 86.86 | 33.03 | 85.43 | 97.77 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-base-p16_ft-64xb64_in1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-98e8652b.pth)|
| ViT-B32\* | ImageNet-21k | 384x384 | 88.30 | 8.56 | 84.01 | 97.08 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-base-p32_ft-64xb64_in1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-9cea8599.pth)|
| ViT-L16\* | ImageNet-21k | 384x384 | 304.72 | 116.68 | 85.63 | 97.63 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-large-p16_ft-64xb64_in1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_ft-64xb64_in1k-384_20210928-b20ba619.pth)|
*Models with \* are converted from other repos.*
mzr1996 marked this conversation as resolved.
Show resolved Hide resolved