Improve docs about ViT.

open-mmlab · Aug 24, 2021 · be4bd58 · be4bd58
1 parent b6766e1
commit be4bd58
Show file tree

Hide file tree

Showing 5 changed files with 120 additions and 11 deletions.
diff --git a/configs/vision_transformer/README.md b/configs/vision_transformer/README.md
@@ -15,9 +15,14 @@
 }
 ```
 
+The training step of Vision Transformers is divided into two steps. The first
+step is training the model on a large dataset, like ImageNet-21k, and get the
+pretrain model. And the second step is training the model on the target dataset,
+like ImageNet-1k, and get the finetune model. Here, we provide both pretrain
+models and finetune models.
+
 ## Pretrain model
 
-The training step
 The pre-trained models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).
 
 ### ImageNet 21k
@@ -33,7 +38,7 @@ The pre-trained models are converted from [model zoo of Google Research](https:/
 
 ## Finetune model
 
-The pre-trained models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).
+The finetune models are converted from [model zoo of Google Research](https://github.com/google-research/vision_transformer#available-vit-models).
 
 ### ImageNet 1k
 |    Model   |  Pretrain    | resolution  | Params(M) |  Flops(G) | Top-1 (%) | Top-5 (%) |   Config   | Download |

diff --git a/configs/vision_transformer/metafile.yml b/configs/vision_transformer/metafile.yml
@@ -0,0 +1,70 @@
+Collections:
+  - Name: Vision Transformer
+    Metadata:
+      Architecture:
+        - Attention Dropout
+        - Convolution
+        - Dense Connections
+        - Dropout
+        - GELU
+        - Layer Normalization
+        - Multi-Head Attention
+        - Scaled Dot-Product Attention
+        - Tanh Activation
+    Paper:
+      URL: https://arxiv.org/pdf/2010.11929.pdf
+      Title: 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale'
+    README: configs/swin_transformer/README.md
+
+Models:
+  - Name: vit-base-p16_in21k-pre-3rdparty_in1k-384
+    In Collection: Vision Transformer
+    Config: configs/vision_transformer/vit-base-p16_ft-evalonly_in-1k-384.py
+    Metadata:
+      FLOPs: 33030000000
+      Parameters: 86860000
+      Training Data: ImageNet
+    Results:
+    - Dataset: ImageNet
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.43
+        Top 5 Accuracy: 97.77
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_in1k-384_20210819-65c4bf44.pth
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+  - Name: vit-base-p32_in21k-pre-3rdparty_in1k-384
+    In Collection: Vision Transformer
+    Config: configs/vision_transformer/vit-base-p32_ft-evalonly_in-1k-384.py
+    Metadata:
+      FLOPs: 8560000000
+      Parameters: 88300000
+      Training Data: ImageNet
+    Results:
+    - Dataset: ImageNet
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 84.01
+        Top 5 Accuracy: 97.08
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_in1k-384_20210819-a56f8886.pth
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/B_32-i21k-300ep-lr_0.001-aug_light1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
+  - Name: vit-large-p16_in21k-pre-3rdparty_in1k-384
+    In Collection: Vision Transformer
+    Config: configs/vision_transformer/vit-large-p16_ft-evalonly_in-1k-384.py
+    Metadata:
+      FLOPs: 116680000000
+      Parameters: 304720000
+      Training Data: ImageNet
+    Results:
+    - Dataset: ImageNet
+      Task: Image Classification
+      Metrics:
+        Top 1 Accuracy: 85.63
+        Top 5 Accuracy: 97.63
+    Weights: https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_in1k-384_20210819-0bb8550c.pth
+    Converted From:
+      Weights: https://console.cloud.google.com/storage/browser/_details/vit_models/augreg/L_16-i21k-300ep-lr_0.001-aug_strong1-wd_0.1-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.01-res_384.npz
+      Code: https://github.com/google-research/vision_transformer/blob/88a52f8892c80c10de99194990a517b4d80485fd/vit_jax/models.py#L208
diff --git a/configs/vision_transformer/vit-large-p32_ft-evalonly_in-1k-384.py b/configs/vision_transformer/vit-large-p32_ft-evalonly_in-1k-384.py
@@ -0,0 +1,34 @@
+# Refer to pytorch-image-models
+_base_ = [
+    '../_base_/models/vit-large-p32.py',
+    '../_base_/datasets/imagenet_bs32_pil_resize.py',
+    '../_base_/schedules/imagenet_bs256_epochstep.py',
+    '../_base_/default_runtime.py'
+]
+
+model = dict(backbone=dict(img_size=384))
+
+img_norm_cfg = dict(
+    mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
+
+train_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='RandomResizedCrop', size=384, backend='pillow'),
+    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='ToTensor', keys=['gt_label']),
+    dict(type='Collect', keys=['img', 'gt_label'])
+]
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(type='Resize', size=(384, -1), backend='pillow'),
+    dict(type='CenterCrop', crop_size=384),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='ImageToTensor', keys=['img']),
+    dict(type='Collect', keys=['img'])
+]
+
+data = dict(
+    train=dict(pipeline=train_pipeline), test=dict(pipeline=test_pipeline))
diff --git a/docs/model_zoo.md b/docs/model_zoo.md
@@ -20,10 +20,10 @@ The ResNet family models below are trained by standard data augmentations, i.e.,
 | ResNet-50             | 25.56     | 4.12     | 76.55 | 93.15 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnet50_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet50_batch256_imagenet_20200708-cfb998bf.log.json) |
 | ResNet-101            | 44.55     | 7.85     | 78.18 | 94.03 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnet101_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_batch256_imagenet_20200708-753f3608.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet101_batch256_imagenet_20200708-753f3608.log.json) |
 | ResNet-152            | 60.19     | 11.58    | 78.63 | 94.16 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnet152_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_batch256_imagenet_20200708-ec25b1f9.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnet152_batch256_imagenet_20200708-ec25b1f9.log.json) |
-| ResNeSt-50*           | 27.48     | 5.41     | 81.13 | 95.59 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest50_imagenet_converted-1ebf0afe.pth) &#124; [log]() |
-| ResNeSt-101*          | 48.28     | 10.27    | 82.32 | 96.24 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest101_imagenet_converted-032caa52.pth) &#124; [log]() |
-| ResNeSt-200*          | 70.2      | 17.53    | 82.41 | 96.22 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest200_imagenet_converted-581a60f2.pth) &#124; [log]() |
-| ResNeSt-269*          | 110.93    | 22.58    | 82.70 | 96.28 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest269_imagenet_converted-59930960.pth) &#124; [log]() |
+| ResNeSt-50\*           | 27.48     | 5.41     | 81.13 | 95.59 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest50_imagenet_converted-1ebf0afe.pth) &#124; [log]() |
+| ResNeSt-101\*          | 48.28     | 10.27    | 82.32 | 96.24 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest101_imagenet_converted-032caa52.pth) &#124; [log]() |
+| ResNeSt-200\*          | 70.2      | 17.53    | 82.41 | 96.22 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest200_imagenet_converted-581a60f2.pth) &#124; [log]() |
+| ResNeSt-269\*          | 110.93    | 22.58    | 82.70 | 96.28 |  | [model](https://download.openmmlab.com/mmclassification/v0/resnest/resnest269_imagenet_converted-59930960.pth) &#124; [log]() |
 | ResNetV1D-50          | 25.58     | 4.36     | 77.54  | 93.57 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnetv1d50_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d50_b32x8_imagenet_20210531-db14775a.log.json) |
 | ResNetV1D-101         | 44.57     | 8.09     | 78.93 | 94.48 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnetv1d101_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d101_b32x8_imagenet_20210531-6e13bcd3.log.json) |
 | ResNetV1D-152         | 60.21     | 11.82    | 79.41 | 94.7 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/resnet/resnetv1d152_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/resnet/resnetv1d152_b32x8_imagenet_20210531-278cf22a.log.json) |
@@ -36,14 +36,13 @@ The ResNet family models below are trained by standard data augmentations, i.e.,
 | ShuffleNetV1 1.0x (group=3)   | 1.87      | 0.146    | 68.13 | 87.81 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/shufflenet_v1/shufflenet_v1_1x_b64x16_linearlr_bn_nowd_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v1/shufflenet_v1_batch1024_imagenet_20200804-5d6cec73.log.json) |
 | ShuffleNetV2 1.0x     | 2.28      | 0.149    | 69.55 | 88.92 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/shufflenet_v2/shufflenet_v2_1x_b64x16_linearlr_bn_nowd_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200812-5bf4721e.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/shufflenet_v2/shufflenet_v2_batch1024_imagenet_20200804-8860eec9.log.json) |
 | MobileNet V2          | 3.5       | 0.319    | 71.86 | 90.42 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/mobilenet_v2/mobilenet_v2_b32x8_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.pth) &#124; [log](https://download.openmmlab.com/mmclassification/v0/mobilenet_v2/mobilenet_v2_batch256_imagenet_20200708-3b2dc3af.log.json) |
-| ViT-B/16*             | 86.86     | 33.03    | 84.20 | 97.18 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_base_patch16_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_base_patch16_384.pth) &#124; [log]() |
-| ViT-B/32*             | 88.3      | 8.56     | 81.73 | 96.13 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_base_patch32_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_base_patch32_384.pth) &#124; [log]() |
-| ViT-L/16*             | 304.72    | 116.68   | 85.08 | 97.38 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_large_patch16_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_large_patch16_384.pth)  &#124; [log]() |
-| ViT-L/32*             | 306.63    | 29.66    | 81.52 | 96.06 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit_large_patch32_384_finetune_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/vit_large_patch32_384.pth)  &#124; [log]() |
+| ViT-B/16\*             | 86.86     | 33.03    | 85.43 | 97.77 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-base-p16_ft-evalonly_in-1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p16_in21k-pre-3rdparty_in1k-384_20210819-65c4bf44.pth) &#124; [log]() |
+| ViT-B/32\*             | 88.3      | 8.56     | 84.01 | 97.08 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-base-p32_ft-evalonly_in-1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-base-p32_in21k-pre-3rdparty_in1k-384_20210819-a56f8886.pth) &#124; [log]() |
+| ViT-L/16\*             | 304.72    | 116.68   | 85.63 | 97.63 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/vision_transformer/vit-large-p16_ft-evalonly_in-1k-384.py) | [model](https://download.openmmlab.com/mmclassification/v0/vit/finetune/vit-large-p16_in21k-pre-3rdparty_in1k-384_20210819-0bb8550c.pth) &#124; [log]() |
 | Swin-Transformer tiny |   28.29   |   4.36   | 81.18 | 95.61 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer/swin_tiny_224_b16x64_300e_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925-66df6be6.pth)  &#124; [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_tiny_224_b16x64_300e_imagenet_20210616_090925.log.json)|
 | Swin-Transformer small|   49.61   |   8.52   | 83.02 | 96.29 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer/swin_small_224_b16x64_300e_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219-7f9d988b.pth)  &#124; [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_small_224_b16x64_300e_imagenet_20210615_110219.log.json)|
 | Swin-Transformer base |   87.77   |  15.14   | 83.36 | 96.44 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py) | [model](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth)  &#124; [log](https://download.openmmlab.com/mmclassification/v0/swin-transformer/swin_base_224_b16x64_300e_imagenet_20210616_190742.log.json)|
-| Transformer in Transformer small* |   23.76  |  3.36 | 81.52 | 95.73 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/tnt/tnt_s_patch16_224_evalonly_imagenet)  | [model](http://download.openmmlab.com/mmclassification/v0/transformer-in-transformer/convert/tnt_s_patch16_224_evalonly_imagenet.pth)  &#124; [log]()|
+| Transformer in Transformer small\* |   23.76  |  3.36 | 81.52 | 95.73 | [config](https://github.com/open-mmlab/mmclassification/blob/master/configs/tnt/tnt_s_patch16_224_evalonly_imagenet)  | [model](http://download.openmmlab.com/mmclassification/v0/transformer-in-transformer/convert/tnt_s_patch16_224_evalonly_imagenet.pth)  &#124; [log]()|
 
 Models with * are converted from other repos, others are trained by ourselves.
 

diff --git a/model-index.yml b/model-index.yml
@@ -8,3 +8,4 @@ Import:
   - configs/shufflenet_v2/metafile.yml
   - configs/swin_transformer/metafile.yml
   - configs/vgg/metafile.yml
+  - configs/vision_transformer/metafile.yml