Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train ViT-b16 from scratch on Imagenet #153

Open
justHungryMan opened this issue Nov 22, 2021 · 8 comments
Open

train ViT-b16 from scratch on Imagenet #153

justHungryMan opened this issue Nov 22, 2021 · 8 comments

Comments

@justHungryMan
Copy link

justHungryMan commented Nov 22, 2021

Thanks for your work and Detailed answer in issues.
I am reproducing the ViT B-16 in Tensorflow based on your Paper and answers. (In this issue, I only deal with original ViT paper)
But I just reached about 47% on imagenet1k validation set (Upstream)
I want to know if my experimental conditions are incorrect, and I hope this issue helps others to reproduce ViT from cratch.

  • I used inception_crop as shown here and random_flip for train step, just resize for val step (crop_size: 224,resize_image_size: 224)
  • input data normalize to (-1.0 ~ 1.0)
  • No randaugment, no mixup in upstream (It was used in How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers)
  • Sigmoid cross entropy (sigmoid_xent on your answer) with label_smoothing=0.0001
  • Batch_size: 4096 / 2 (for memory issue in tpu V3-8), Learning_rate: 3e-3 / 2(half batch_size, half lr), adam: (0.9, 0.99), weight_decay: 0.3, global_clipnorm=1.0, epochs=300
  • representation_size = 768, dropout=0.1, attn_drop=0.0, no dropPath
  • CosineDecay with warmup 20k step (10k * 2 (half batch_size, double warmup))
  • I used bfloat16
  • The final classification layer's bias weights were initialized with -10

Here is my train_loss, val_loss, train_acc, val_acc, lr curve

Train loss

train_loss

Val Loss

val_loss

Train acc

train_acc

Val acc

val_acc

LR

lr

I don't know if you can afford to look at the code, but here is my code .

@andsteing
Copy link
Collaborator

Hi Sangjun

A couple of differences:

  1. Even with smaller batch size, I would keep the same lr=3e-3 (when using Adam)
  2. Speaking of Adam, we used the following parameters: beta1=.9, beta2=.999
  3. We used float32 throughout for the pre-training.
  4. For evaluation, we first resized to 256px (smaller side, keeping original aspect ratio), then took a 224px central crop.

Final validation top-1 accuracy was 74.79% and the training curves looked as follows:
image

HTH, Andreas

@justHungryMan
Copy link
Author

justHungryMan commented Nov 22, 2021

@andsteing Thank you for your answer

  1. Why is the batch size different but the same lr?
  2. I know you are using TPU, but is there a reason why you didn't use bfloat16 and instead used float32? (accuracy in attention?)
  3. Ok, same for inception crop in train_set but resize 256 on smaller side (for example 768x512 -> 384x256) & center_crop 224 in val_set

I'll be back after 3 days for train... :)

@andsteing
Copy link
Collaborator

  1. When pre-training vision transformers we found that optimal learning rate was pretty stable with different batch sizes. You could try different learning rates to verify, but I fear there's no simple relationship between learning rate and batch size, when pre-training vision transformers (see e.g. http://arxiv.org/abs/1811.03600 for an empirical study into this subject).
  2. In general going from full to half precision lowers the quality and can lead to instabilities. We would usually do everything in float32, and then later on try individual parts of the parameters and/or optimizer to run with bfloat16. But lowering the precision while keeping the quality constant in general requires some trial and error and it's not obvious what will work and what won't
  3. That's correct

@justHungryMan
Copy link
Author

Hi @andsteing.
Thanks to you, I was able to do the experiment well.
For users who want to train from scratch, I opened repo (https://github.com/justHungryMan/vision-transformer-tf). (Please let me know if this is a problem.)

I've discovered a few things while doing the experiment, and I'd like to hear your opinions.

  1. Val accuracy in upstream is not directed to higher accuracy in downstream. When I train b/16 with lr=0.003, it reached 74.37% in upstream (imagenet1k, bs=1024, call it A) and lr=0.00075 reached 74.52% in upstream(call it B) but A reached 74.896% in downstream and B reached 72.467%. It is interesting but hard to understand why.

  2. When finetuing, the public code only uses "resize" in val_set, but resize to 416 and center crop to 384 gives higher accuracy (74.896% -> 75.945%). I wanna ask why you use "resize" on val_set?

@yzlnew
Copy link

yzlnew commented Feb 8, 2023

Hi, @andsteing . I'm trying to train ViT-b16 from scratch on ImageNet using SAM optimizer. Could you share your training details?
From paper, I believe it is using the same as base optimizer with a rho=0.2. However, I can't get anywhere near 79.9%.

@andsteing
Copy link
Collaborator

@xiangning-chen who added SAM checkpoints in #119

@xiangning-chen
Copy link
Contributor

xiangning-chen commented Feb 9, 2023

Hi @yzlnew , how many machines are you using to train the model? This essentially determine the distributed level of SAM, which corresponds to the m-sharpness discussed in section 4.1 here. For my experiments, I used 64 TPU chips. If you are using fewer machines, my experience is to enlarge the rho.

@yzlnew
Copy link

yzlnew commented Feb 10, 2023

@xiangning-chen Thanks for clarification! I have tried training on 4/8/32 A100 cards with rho=0.2. And I also notice that a larger rho can improve the performance in other experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants