train ViT-b16 from scratch on Imagenet #153

justHungryMan · 2021-11-22T11:21:06Z

Thanks for your work and Detailed answer in issues.
I am reproducing the ViT B-16 in Tensorflow based on your Paper and answers. (In this issue, I only deal with original ViT paper)
But I just reached about 47% on imagenet1k validation set (Upstream)
I want to know if my experimental conditions are incorrect, and I hope this issue helps others to reproduce ViT from cratch.

I used inception_crop as shown here and random_flip for train step, just resize for val step (crop_size: 224,resize_image_size: 224)
input data normalize to (-1.0 ~ 1.0)
No randaugment, no mixup in upstream (It was used in How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers)
Sigmoid cross entropy (sigmoid_xent on your answer) with label_smoothing=0.0001
Batch_size: 4096 / 2 (for memory issue in tpu V3-8), Learning_rate: 3e-3 / 2(half batch_size, half lr), adam: (0.9, 0.99), weight_decay: 0.3, global_clipnorm=1.0, epochs=300
representation_size = 768, dropout=0.1, attn_drop=0.0, no dropPath
CosineDecay with warmup 20k step (10k * 2 (half batch_size, double warmup))
I used bfloat16
The final classification layer's bias weights were initialized with -10

Here is my train_loss, val_loss, train_acc, val_acc, lr curve

Train loss

Val Loss

Train acc

Val acc

LR

I don't know if you can afford to look at the code, but here is my code .

andsteing · 2021-11-22T16:04:59Z

Hi Sangjun

A couple of differences:

Even with smaller batch size, I would keep the same lr=3e-3 (when using Adam)
Speaking of Adam, we used the following parameters: beta1=.9, beta2=.999
We used float32 throughout for the pre-training.
For evaluation, we first resized to 256px (smaller side, keeping original aspect ratio), then took a 224px central crop.

Final validation top-1 accuracy was 74.79% and the training curves looked as follows:

HTH, Andreas

justHungryMan · 2021-11-22T17:10:46Z

@andsteing Thank you for your answer

Why is the batch size different but the same lr?
I know you are using TPU, but is there a reason why you didn't use bfloat16 and instead used float32? (accuracy in attention?)
Ok, same for inception crop in train_set but resize 256 on smaller side (for example 768x512 -> 384x256) & center_crop 224 in val_set

I'll be back after 3 days for train... :)

andsteing · 2021-11-23T17:04:17Z

When pre-training vision transformers we found that optimal learning rate was pretty stable with different batch sizes. You could try different learning rates to verify, but I fear there's no simple relationship between learning rate and batch size, when pre-training vision transformers (see e.g. http://arxiv.org/abs/1811.03600 for an empirical study into this subject).
In general going from full to half precision lowers the quality and can lead to instabilities. We would usually do everything in float32, and then later on try individual parts of the parameters and/or optimizer to run with bfloat16. But lowering the precision while keeping the quality constant in general requires some trial and error and it's not obvious what will work and what won't
That's correct

justHungryMan · 2021-12-05T08:55:14Z

Hi @andsteing.
Thanks to you, I was able to do the experiment well.
For users who want to train from scratch, I opened repo (https://github.com/justHungryMan/vision-transformer-tf). (Please let me know if this is a problem.)

I've discovered a few things while doing the experiment, and I'd like to hear your opinions.

Val accuracy in upstream is not directed to higher accuracy in downstream. When I train b/16 with lr=0.003, it reached 74.37% in upstream (imagenet1k, bs=1024, call it A) and lr=0.00075 reached 74.52% in upstream(call it B) but A reached 74.896% in downstream and B reached 72.467%. It is interesting but hard to understand why.
When finetuing, the public code only uses "resize" in val_set, but resize to 416 and center crop to 384 gives higher accuracy (74.896% -> 75.945%). I wanna ask why you use "resize" on val_set?

yzlnew · 2023-02-08T06:19:33Z

Hi, @andsteing . I'm trying to train ViT-b16 from scratch on ImageNet using SAM optimizer. Could you share your training details?
From paper, I believe it is using the same as base optimizer with a rho=0.2. However, I can't get anywhere near 79.9%.

andsteing · 2023-02-08T09:07:22Z

@xiangning-chen who added SAM checkpoints in #119

xiangning-chen · 2023-02-09T19:22:17Z

Hi @yzlnew , how many machines are you using to train the model? This essentially determine the distributed level of SAM, which corresponds to the m-sharpness discussed in section 4.1 here. For my experiments, I used 64 TPU chips. If you are using fewer machines, my experience is to enlarge the rho.

yzlnew · 2023-02-10T06:54:52Z

@xiangning-chen Thanks for clarification! I have tried training on 4/8/32 A100 cards with rho=0.2. And I also notice that a larger rho can improve the performance in other experiments.

andsteing mentioned this issue Nov 23, 2021

Inquiry on pretraining acc #62

Open

justHungryMan mentioned this issue Jan 19, 2023

why using bce as loss function justHungryMan/vision-transformer-tf#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train ViT-b16 from scratch on Imagenet #153

train ViT-b16 from scratch on Imagenet #153

justHungryMan commented Nov 22, 2021 •

edited

Loading

andsteing commented Nov 22, 2021

justHungryMan commented Nov 22, 2021 •

edited

Loading

andsteing commented Nov 23, 2021

justHungryMan commented Dec 5, 2021

yzlnew commented Feb 8, 2023

andsteing commented Feb 8, 2023

xiangning-chen commented Feb 9, 2023 •

edited

Loading

yzlnew commented Feb 10, 2023

train ViT-b16 from scratch on Imagenet #153

train ViT-b16 from scratch on Imagenet #153

Comments

justHungryMan commented Nov 22, 2021 • edited Loading

Train loss

Val Loss

Train acc

Val acc

LR

andsteing commented Nov 22, 2021

justHungryMan commented Nov 22, 2021 • edited Loading

andsteing commented Nov 23, 2021

justHungryMan commented Dec 5, 2021

yzlnew commented Feb 8, 2023

andsteing commented Feb 8, 2023

xiangning-chen commented Feb 9, 2023 • edited Loading

yzlnew commented Feb 10, 2023

justHungryMan commented Nov 22, 2021 •

edited

Loading

justHungryMan commented Nov 22, 2021 •

edited

Loading

xiangning-chen commented Feb 9, 2023 •

edited

Loading