Skip to content

Commit

Permalink
Improve docs about ViT.
Browse files Browse the repository at this point in the history
  • Loading branch information
mzr1996 committed Aug 24, 2021
1 parent b6766e1 commit be4bd58
Show file tree
Hide file tree
Showing 5 changed files with 120 additions and 11 deletions.
9 changes: 7 additions & 2 deletions configs/vision_transformer/README.md
Expand Up @@ -15,9 +15,14 @@
}
```

The training step of Vision Transformers is divided into two steps. The first
step is training the model on a large dataset, like ImageNet-21k, and get the
pretrain model. And the second step is training the model on the target dataset,
like ImageNet-1k, and get the finetune model. Here, we provide both pretrain
models and finetune models.

## Pretrain model

The training step
The pre-trained models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).

### ImageNet 21k
Expand All @@ -33,7 +38,7 @@ The pre-trained models are converted from [model zoo of Google Research](https:/

## Finetune model

The pre-trained models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).
The finetune models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).

### ImageNet 1k
| Model | Pretrain | resolution | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
Expand Down
70 changes: 70 additions & 0 deletions configs/vision_transformer/metafile.yml
@@ -0,0 +1,70 @@
Collections:
- Name: Vision Transformer
Metadata:
Architecture:
- Attention Dropout
- Convolution
- Dense Connections
- Dropout
- GELU
- Layer Normalization
- Multi-Head Attention
- Scaled Dot-Product Attention
- Tanh Activation
Paper:
URL: https://arxiv.org/pdf/2010.11929.pdf
Title: 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale'
README: configs/swin_transformer/README.md

Models:
- Name: vit-base-p16_in21k-pre-3rdparty_in1k-384
In Collection: Vision Transformer
Config: configs/vision_transformer/vit-base-p16_ft-evalonly_in-1k-384.py
Metadata:
FLOPs: 33030000000
Parameters: 86860000
Training Data: ImageNet
Results:
- Dataset: ImageNet
Task: Image Classification
Metrics:
Top 1 Accuracy: 85.43
Top 5 Accuracy: 97.77
Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_in1k-384_20210819-65c4bf44.pth
Converted From:
Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz
Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
- Name: vit-base-p32_in21k-pre-3rdparty_in1k-384
In Collection: Vision Transformer
Config: configs/vision_transformer/vit-base-p32_ft-evalonly_in-1k-384.py
Metadata:
FLOPs: 8560000000
Parameters: 88300000
Training Data: ImageNet
Results:
- Dataset: ImageNet
Task: Image Classification
Metrics:
Top 1 Accuracy: 84.01
Top 5 Accuracy: 97.08
Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_in1k-384_20210819-a56f8886.pth
Converted From:
Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_32-i21k-300ep-lr_0.001-aug_light1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
- Name: vit-large-p16_in21k-pre-3rdparty_in1k-384
In Collection: Vision Transformer
Config: configs/vision_transformer/vit-large-p16_ft-evalonly_in-1k-384.py
Metadata:
FLOPs: 116680000000
Parameters: 304720000
Training Data: ImageNet
Results:
- Dataset: ImageNet
Task: Image Classification
Metrics:
Top 1 Accuracy: 85.63
Top 5 Accuracy: 97.63
Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_in1k-384_20210819-0bb8550c.pth
Converted From:
Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/L_16-i21k-300ep-lr_0.001-aug_strong1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
34 changes: 34 additions & 0 deletions configs/vision_transformer/vit-large-p32_ft-evalonly_in-1k-384.py
@@ -0,0 +1,34 @@
# Refer to pytorch-image-models
_base_ = [
'../_base_/models/vit-large-p32.py',
'../_base_/datasets/imagenet_bs32_pil_resize.py',
'../_base_/schedules/imagenet_bs256_epochstep.py',
'../_base_/default_runtime.py'
]

model = dict(backbone=dict(img_size=384))

img_norm_cfg = dict(
mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)

train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=384, backend='pillow'),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=(384, -1), backend='pillow'),
dict(type='CenterCrop', crop_size=384),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]

data = dict(
train=dict(pipeline=train_pipeline), test=dict(pipeline=test_pipeline))
17 changes: 8 additions & 9 deletions docs/model_zoo.md
Expand Up @@ -20,10 +20,10 @@ The ResNet family models below are trained by standard data augmentations, i.e.,
| ResNet-50 | 25.56 | 4.12 | 76.55 | 93.15 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnet50_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.log.json) |
| ResNet-101 | 44.55 | 7.85 | 78.18 | 94.03 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnet101_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_batch256_imagenet_20200708-753f3608.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_batch256_imagenet_20200708-753f3608.log.json) |
| ResNet-152 | 60.19 | 11.58 | 78.63 | 94.16 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnet152_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_batch256_imagenet_20200708-ec25b1f9.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_batch256_imagenet_20200708-ec25b1f9.log.json) |
| ResNeSt-50* | 27.48 | 5.41 | 81.13 | 95.59 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest50_imagenet_converted-1ebf0afe.pth) | [log]() |
| ResNeSt-101* | 48.28 | 10.27 | 82.32 | 96.24 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest101_imagenet_converted-032caa52.pth) | [log]() |
| ResNeSt-200* | 70.2 | 17.53 | 82.41 | 96.22 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest200_imagenet_converted-581a60f2.pth) | [log]() |
| ResNeSt-269* | 110.93 | 22.58 | 82.70 | 96.28 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest269_imagenet_converted-59930960.pth) | [log]() |
| ResNeSt-50\* | 27.48 | 5.41 | 81.13 | 95.59 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest50_imagenet_converted-1ebf0afe.pth) | [log]() |
| ResNeSt-101\* | 48.28 | 10.27 | 82.32 | 96.24 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest101_imagenet_converted-032caa52.pth) | [log]() |
| ResNeSt-200\* | 70.2 | 17.53 | 82.41 | 96.22 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest200_imagenet_converted-581a60f2.pth) | [log]() |
| ResNeSt-269\* | 110.93 | 22.58 | 82.70 | 96.28 | | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest269_imagenet_converted-59930960.pth) | [log]() |
| ResNetV1D-50 | 25.58 | 4.36 | 77.54 | 93.57 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnetv1d50_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.log.json) |
| ResNetV1D-101 | 44.57 | 8.09 | 78.93 | 94.48 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnetv1d101_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.log.json) |
| ResNetV1D-152 | 60.21 | 11.82 | 79.41 | 94.7 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnetv1d152_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth) | [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.log.json) |
Expand All @@ -36,14 +36,13 @@ The ResNet family models below are trained by standard data augmentations, i.e.,
| ShuffleNetV1 1.0x (group=3) | 1.87 | 0.146 | 68.13 | 87.81 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/shufflenet_v1/shufflenet_v1_1x_b64x16_linearlr_bn_nowd_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth) | [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.log.json) |
| ShuffleNetV2 1.0x | 2.28 | 0.149 | 69.55 | 88.92 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/shufflenet_v2/shufflenet_v2_1x_b64x16_linearlr_bn_nowd_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth) | [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200804-8860eec9.log.json) |
| MobileNet V2 | 3.5 | 0.319 | 71.86 | 90.42 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mobilenet_v2/mobilenet_v2_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) | [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.log.json) |
| ViT-B/16* | 86.86 | 33.03 | 84.20 | 97.18 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_base_patch16_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_base_patch16_384.pth) | [log]() |
| ViT-B/32* | 88.3 | 8.56 | 81.73 | 96.13 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_base_patch32_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_base_patch32_384.pth) | [log]() |
| ViT-L/16* | 304.72 | 116.68 | 85.08 | 97.38 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_large_patch16_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_large_patch16_384.pth) | [log]() |
| ViT-L/32* | 306.63 | 29.66 | 81.52 | 96.06 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_large_patch32_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_large_patch32_384.pth) | [log]() |
| ViT-B/16\* | 86.86 | 33.03 | 85.43 | 97.77 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-base-p16_ft-evalonly_in-1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_in1k-384_20210819-65c4bf44.pth) | [log]() |
| ViT-B/32\* | 88.3 | 8.56 | 84.01 | 97.08 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-base-p32_ft-evalonly_in-1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_in1k-384_20210819-a56f8886.pth) | [log]() |
| ViT-L/16\* | 304.72 | 116.68 | 85.63 | 97.63 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-large-p16_ft-evalonly_in-1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_in1k-384_20210819-0bb8550c.pth) | [log]() |
| Swin-Transformer tiny | 28.29 | 4.36 | 81.18 | 95.61 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth) | [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925.log.json)|
| Swin-Transformer small| 49.61 | 8.52 | 83.02 | 96.29 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer/swin_small_224_b16x64_300e_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth) | [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219.log.json)|
| Swin-Transformer base | 87.77 | 15.14 | 83.36 | 96.44 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth) | [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742.log.json)|
| Transformer in Transformer small* | 23.76 | 3.36 | 81.52 | 95.73 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/tnt/tnt_s_patch16_224_evalonly_imagenet) | [model](http://download.openmmlab.com/mmclassification/v0/transformer-in-transformer/convert/tnt_s_patch16_224_evalonly_imagenet.pth) | [log]()|
| Transformer in Transformer small\* | 23.76 | 3.36 | 81.52 | 95.73 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/tnt/tnt_s_patch16_224_evalonly_imagenet) | [model](http://download.openmmlab.com/mmclassification/v0/transformer-in-transformer/convert/tnt_s_patch16_224_evalonly_imagenet.pth) | [log]()|

Models with * are converted from other repos, others are trained by ourselves.

Expand Down
1 change: 1 addition & 0 deletions model-index.yml
Expand Up @@ -8,3 +8,4 @@ Import:
- configs/shufflenet_v2/metafile.yml
- configs/swin_transformer/metafile.yml
- configs/vgg/metafile.yml
- configs/vision_transformer/metafile.yml

0 comments on commit be4bd58

Please sign in to comment.