Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on pretraining acc #62

Open
yyou1996 opened this issue Jan 30, 2021 · 22 comments
Open

Inquiry on pretraining acc #62

yyou1996 opened this issue Jan 30, 2021 · 22 comments

Comments

@yyou1996
Copy link

Thanks for your excellent work. Would you mind me asking what is the pretraining acc on imagenet2012 that then used for finetuning?

@andsteing
Copy link
Collaborator

Note that the published checkpoints were pretrained on imagenet21k. Pretraining accuracy on the validation set at the endof the pretraining was:

name val_prec_1
ViT-B_16 47.88%
ViT-B_32 44.04%
ViT-L_16 49.90%
ViT-L_32 45.42%
ViT-H_14 49.06%

@cissoidx
Copy link

cissoidx commented Jul 29, 2021

@andsteing hello, do you have, by any chance, the acc of pretraining on imagenet1k? I mean pretraining from absolute scratch, not finetuning.
I reached 48.4% using ViT-B_16 on the imanget1k validation set, would like to have reference if you have it.

@andsteing
Copy link
Collaborator

andsteing commented Jul 29, 2021

Sure, results after 300 ep (edit: L/32 and L/16 were trained for 90 ep) training on i1k from scratch are below:

name val_prec_1
ViT-B/32 i1k 69.19%
ViT-B/16 i1k 74.79%
ViT-L/32 i1k 66.90%
ViT-L/16 i1k 72.59%

@cissoidx
Copy link

thanks. I trained much less ep. will try again.

@andsteing
Copy link
Collaborator

For training ViT from scratch you'll find that data augmentation and model regularization really help with medium-sized datasets (such as ImageNet, and ImageNet-21k), but with even longer training schedules (1000 ep for ImageNet, and 300 ep for ImageNet-21k):

@cissoidx
Copy link

thanks for info.

@cissoidx
Copy link

@andsteing What is the loss function you use when doing pretraining? Is it the same as what you do in finetuning? I have seen people use semantic loss, is it necessary to reach sota?

@andsteing
Copy link
Collaborator

We used sigmoid crossentropy during pre-training (and we're using softmax crossentropy for fine-tuning).

@cissoidx
Copy link

Hi @andsteing,

afaik, normally we use sigmoid ce loss in multi-label task, since we assume that the labels are independent, and we use softmax ce loss in single label task, since we are looking for the max class. And these two are actually the same in binary classification.

pretraining is not a multi-label task, why do you use sigmoid ce loss?

@cissoidx
Copy link

I found out that if I use ce loss, the loss curve decreases fast. but if I use sigmoid ce loss, the loss curve barely decreases. Is it normal?

@andsteing
Copy link
Collaborator

We experimented with both softmax ce and sigmoid ce, and found that sigmoid ce works better even with single label i1k - see also Are we done with imagenet? paper similar results.

As for training loss evolution, we observed the following evolution:

image

@cissoidx
Copy link

Thanks for replying

@cissoidx
Copy link

cissoidx commented Sep 6, 2021

@andsteing are these accuracies you mentioned obtained with 224*224 resolution?

@cissoidx
Copy link

cissoidx commented Sep 8, 2021

@andsteing Can you please confirm that these pretraining acc are obtained with 224*224 resolution?

@andsteing
Copy link
Collaborator

(just came back from holiday)

Yes, the i1k pre-training accuracies from above are indeed for 224*224 resolution. We only changed resolution for fine-tuning runs.

@cissoidx
Copy link

@andsteing Thanks for your help. I finally reached 75.5% validation accuracy in pretraining in1k using b/16, even without some tricks that are mentioned in your papers, like stochastic depths (i do not use it), linear scheduler (i used cosine), ADAM (I used SGD), grad norm (I do not use it). I just wonder if there is an official statement of the accuracy you mentioned above. How should I cite your work in a decent way?

@cissoidx
Copy link

@andsteing in paper "how to train your vit", figure 4, left plot. vit on imagenet1k 300 epochs reached 83%. your above comment does not match these numbers. I might have missed something, can you please clarify?

Sure, results after 300 ep (edit: L/32 and L/16 were trained for 90 ep) training on i1k from scratch are below:

name val_prec_1
ViT-B/32 i1k 69.19%
ViT-B/16 i1k 74.79%
ViT-L/32 i1k 66.90%
ViT-L/16 i1k 72.59%

@cissoidx
Copy link

截屏2021-10-11 上午11 51 52

this is the result that I am refering to. Looking forward to your reply.

@andsteing
Copy link
Collaborator

Hi @cissoidx

This thread started on January 30th 2021 and is about the i1k from-scratch training in the original ViT paper. The paper how to train your ViT applies additional AugReg to increase those numbers, but it was published only in June 2021, so I thought it would not apply to the original question (and I thought starting to mix numbers from two different papers could make the thread more confusing).

You have all the data about the pre-training and fine-tuning of the how to train your ViT in the Colab
https://colab.research.google.com/github/google-research/vision_transformer/blob/master/vit_jax_augreg.ipynb

Best, Andreas

@justHungryMan
Copy link

Hi @cissoidx
I'm training ViT-b16 from scratch on imagenet1k now.
I just get 47.6% on val accruacy and you also got 48.4% (#62 (comment))
Can you tell me how you improved the accuracy?
All my parameters are same as vit paper.

@cissoidx
Copy link

@justHungryMan I guess it is not possible to reach paper sota with the default hyperparams. Since they do not release the code, you have to tune the hyperparams. some suggestions: use imagenet aug (proposed by the randaug package), weight decay = 0.004.

@andsteing
Copy link
Collaborator

See also discussion on #153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants