Huge discrepancy between HuggingFace and Timm for ViT and other vision transformers #19305

Phuoc-Hoan-Le · 2022-10-03T20:56:31Z

Feature request

Differences between HugginFace and Timm implementation of Vision Transformers can be listed as below:
-Missing stochastic depth (https://arxiv.org/abs/2012.12877)
-Using m.weight.data.normal_(mean=0.0, std=0.02) instead of trunc_normal_()
-Missing trunc_normal_() init for the position embedding and cls_token

My DeiT started training properly once I started using the trunc_normal_() init and stochastic depth for my huggingface ViT model. Also, I remove the pruning head functionality and I no longer inherit the HuggingFace ViT model class from the "PretrainedModel" class, but I'm not actually sure if this actually caused my training to work properly.

Motivation

These things could mean the difference between getting Nan or not during training for DeiT using the process from https://arxiv.org/abs/2012.12877

Your contribution

Would love to share my code but I can't. I refer you to read the code (https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py)

LysandreJik · 2022-10-03T21:19:24Z

cc @NielsRogge @amyeroberts @alaradirik

alaradirik · 2022-10-05T10:12:41Z

@CharlesLeeeee thank you for bringing this up! We are aware of the discrepancy and aim to rectify it soon. We will fix the parameter initialization issue shortly and open a separate PR to add stochastic depth.

cc @LysandreJik @NielsRogge @amyeroberts

Phuoc-Hoan-Le · 2022-10-05T23:40:07Z

I believe setting eps in layernorm to 1e-6 rather than 1e-12 is also important.

rwightman · 2022-10-14T18:45:29Z

FWIW there is an issue related to this on the timm side as well huggingface/pytorch-image-models#1477

As per my comments, the init issue should be minor / non consequential as it would not result in a significant difference given that std == .02. I've trained from scratch with much more significantly different inits and the end results aren't too far off.

The layer norm eps is likely an issue though, that was not mentioned on the timm side. For float16, 0 + 1e-12 = 0, not so for 1e-6 or 1e-5, which are defaults for all vision models I'm aware of that use LN. It looks like there are possibly other models that incorrectly use 1e-12 such as convnext? This could cause stability issues at reduced precision and will change the validation results for weights pretrained with 1e-5 or 1e-6. Generally 1e-12 should only be used as eps if you're sticking with float32 (or all uses of that eps are guaranteed to be upcast to float32).

Phuoc-Hoan-Le · 2022-11-01T20:34:25Z

Kaiming initialization should be used for the nn.Conv2d rather than .normal_() initialization in the class ViTPreTrainedModel or any class that directly inherits from PretrainedModel. And the biases of the nn.Conv2d in ViT should be initialized the same way as PyTorch. (https://pytorch.org/docs/stable/_modules/torch/nn/modules/conv.html#Conv2d) @LysandreJik @NielsRogge @amyeroberts @alaradirik

github-actions · 2022-11-26T15:02:24Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2022-12-09T11:15:49Z

@CharlesLeeeee you are partially right, it seems that ViT uses PyTorch's default initialization scheme for nn.conv2d, at least in timm. The JAX init however uses a LeCun normal as seen here.

I'm working on this in #19449

alaradirik self-assigned this Oct 5, 2022

alaradirik mentioned this issue Oct 5, 2022

🚨 🚨 🚨 Fix ViT parameter initialization #19341

Merged

3 tasks

NielsRogge added the Core: Modeling Internals of the library; Models. label Nov 29, 2022

github-actions bot closed this as completed Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge discrepancy between HuggingFace and Timm for ViT and other vision transformers #19305

Huge discrepancy between HuggingFace and Timm for ViT and other vision transformers #19305

Phuoc-Hoan-Le commented Oct 3, 2022 •

edited

Loading

LysandreJik commented Oct 3, 2022

alaradirik commented Oct 5, 2022 •

edited

Loading

Phuoc-Hoan-Le commented Oct 5, 2022

rwightman commented Oct 14, 2022 •

edited

Loading

Phuoc-Hoan-Le commented Nov 1, 2022

github-actions bot commented Nov 26, 2022

NielsRogge commented Dec 9, 2022 •

edited

Loading

Huge discrepancy between HuggingFace and Timm for ViT and other vision transformers #19305

Huge discrepancy between HuggingFace and Timm for ViT and other vision transformers #19305

Comments

Phuoc-Hoan-Le commented Oct 3, 2022 • edited Loading

Feature request

Motivation

Your contribution

LysandreJik commented Oct 3, 2022

alaradirik commented Oct 5, 2022 • edited Loading

Phuoc-Hoan-Le commented Oct 5, 2022

rwightman commented Oct 14, 2022 • edited Loading

Phuoc-Hoan-Le commented Nov 1, 2022

github-actions bot commented Nov 26, 2022

NielsRogge commented Dec 9, 2022 • edited Loading

Phuoc-Hoan-Le commented Oct 3, 2022 •

edited

Loading

alaradirik commented Oct 5, 2022 •

edited

Loading

rwightman commented Oct 14, 2022 •

edited

Loading

NielsRogge commented Dec 9, 2022 •

edited

Loading