Hyper-parameters of ViT-B/16 training from scratch #2

liuyuyuil · 2020-10-26T08:44:22Z

Thanks for sharing your code. Can you provide the hyper-parameters (e.g. learning rate, weight decay, optimizer type, training epochs) of ViT-B/16 training from scratch on ImageNet dataset? Many thanks.

andsteing · 2020-10-26T09:27:31Z

Note that for the published checkpoints we pretrained on imagenet21k (see README), using ~~102.4M~~ 12.4M examples for training.

As for the hyper parameters:

batch_size=4096
lr.base=1e-3
lr.decay_type=linear
lr.linear_end=1e-5
lr.warmup_steps=10_000
dropout_rate=0.1
num_epochs=90
weight_decay=0.03
optimizer=Adam
representation_size=768

We used the same cropping code but an image size of 224 (thus 14x14 grid).

The model was exactly the same, other than the additional penultimate layer with dimensionality representation_size. The final classification layer's bias weights were initialized with -10.

liuyuyuil · 2020-10-26T10:59:48Z

Thanks for reply !

liuyuyuil · 2020-10-26T11:17:41Z

By the way, what's the top1 accuracy of ViT-B/16 training from scratch on ImageNet with an image size of 224 ? There is a statement in the paper

'With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch'

Is it 77.9% ? Thanks.

andsteing · 2020-10-26T15:05:12Z

The 79.9% refers to the self-supervised pretraining - see B.1.2. in the appendix for details. The B/16 model pre-trained and fine-tuned on imagenet2012 achieves 77.9% (see table 5 in the appendix).

andsteing · 2021-08-17T13:51:01Z

That was a typo (now corrected) - it should have said 12.4M examples. See this comment for more details.

cissoidx · 2021-08-18T03:19:54Z

The 79.9% refers to the self-supervised pretraining - see B.1.2. in the appendix for details. The B/16 model pre-trained and fine-tuned on imagenet2012 achieves 77.9% (see table 5 in the appendix).

What is the top1 acc of pretraining (without finetuning) on imagenet2012?

andsteing · 2021-08-18T05:55:08Z

top1 acc (evaluated on 50k holdout from training set) at the end of the pre-training from the original ViT paper was as follows:

name	val_acc
ViT-B/32 i1k	69.19%
ViT-B/16 i1k	74.79%
ViT-L/32 i1k	66.90%
ViT-L/16 i1k	72.59%

Note that we have much more detail about pre-training from scratch in the paper How to train your ViT?..., check out the database in our Colab:

https://colab.research.google.com/github/google-research/vision_transformer/blob/master/vit_jax_augreg.ipynb

For example, to show you the final pre-training top1 accuracy of a variety of models and pre-training settings:

import plotly.express as px

px.scatter(
    df.drop_duplicates('filename').query('ds=="i1k" and final_val>0.4'),
    y='final_val',
    x='aug',
    color='wd',
    symbol='do',
    facet_col='name',
    facet_col_wrap=4,
)

cissoidx · 2021-08-27T10:14:20Z

Note that for the published checkpoints we pretrained on imagenet21k (see README), using ~~102.4M~~ 12.4M examples for training.

As for the hyper parameters:
batch_size=4096
lr.base=1e-3
lr.decay_type=linear
lr.linear_end=1e-5
lr.warmup_steps=10_000
dropout_rate=0.1
num_epochs=90
weight_decay=0.03
optimizer=Adam
representation_size=768
We used the same cropping code but an image size of 224 (thus 14x14 grid).

The model was exactly the same, other than the additional penultimate layer with dimensionality representation_size. The final classification layer's bias weights were initialized with -10.

Hi, do you use the same hyperparameters in pretraining imagenet1k?

andsteing · 2021-09-01T16:37:42Z

For Imagenet1k pre-training we used the following hparams different from the hparams used for pre-training on Imagenet21k:

grad_clip_norm=1.0
lr.base=3e-3
lr.decay_type=cosine
dropout_rate=0.1  # B/16, B/32
dropout_rate=0.2  # L/16, L/32
num_epochs=300  # B/16, B/32
weight_decay=0.3

(note that training L/16 and L/32 on i1k can be tricky; you might want to reduce the number of epochs or augment data as described in How to train your ViT? paper)

liuyuyuil closed this as completed Oct 27, 2020

chunfuchen mentioned this issue Nov 21, 2020

Hyper-parameters of ViT-B/16 training from scratch #28

Closed

andsteing mentioned this issue May 3, 2021

train a network from scratch #96

Open

This was referenced Aug 17, 2021

Request for source code for pretraining ViT #130

Open

How to train model from scratch without ViT-B_16.npz? #122

Open

Where are RandAug and Mixup in AugReg ? #126

Open

cissoidx mentioned this issue Sep 7, 2021

About accumulation steps and replicas #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyper-parameters of ViT-B/16 training from scratch #2

Hyper-parameters of ViT-B/16 training from scratch #2

liuyuyuil commented Oct 26, 2020

andsteing commented Oct 26, 2020 •

edited

Loading

liuyuyuil commented Oct 26, 2020

liuyuyuil commented Oct 26, 2020 •

edited

Loading

andsteing commented Oct 26, 2020

andsteing commented Aug 17, 2021

cissoidx commented Aug 18, 2021

andsteing commented Aug 18, 2021 •

edited

Loading

cissoidx commented Aug 27, 2021

andsteing commented Sep 1, 2021

Hyper-parameters of ViT-B/16 training from scratch #2

Hyper-parameters of ViT-B/16 training from scratch #2

Comments

liuyuyuil commented Oct 26, 2020

andsteing commented Oct 26, 2020 • edited Loading

liuyuyuil commented Oct 26, 2020

liuyuyuil commented Oct 26, 2020 • edited Loading

andsteing commented Oct 26, 2020

andsteing commented Aug 17, 2021

cissoidx commented Aug 18, 2021

andsteing commented Aug 18, 2021 • edited Loading

cissoidx commented Aug 27, 2021

andsteing commented Sep 1, 2021

andsteing commented Oct 26, 2020 •

edited

Loading

liuyuyuil commented Oct 26, 2020 •

edited

Loading

andsteing commented Aug 18, 2021 •

edited

Loading