🚨 🚨Bring some dinos to modern standards by molbap · Pull Request #46266 · huggingface/transformers

molbap · 2026-05-28T16:22:48Z

What does this PR do?

Part of the larger vision model refactor #41693 focused on dinov2, which has still some usage and downloads, but mostly serves as a basis for many other models. Attempt at putting this in line with the rest of the lib.

HuggingFaceDocBuilderDev · 2026-05-28T16:43:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

guarin

Thanks, this will make things much easier! Left more questions than comments :)

guarin · 2026-05-29T07:55:05Z

-            self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size))
+        self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size))


Might this break things downstream if now every model expects mask token to exist?

totally, missing ternary with None default

guarin · 2026-05-29T08:10:23Z

-        self.all_head_size = self.num_attention_heads * self.attention_head_size
-        self.dropout_prob = config.attention_probs_dropout_prob
-        self.scaling = self.attention_head_size**-0.5
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.attention_dropout = config.attention_probs_dropout_prob
+        self.scaling = self.head_dim**-0.5


Do we consider attributes as part of the API that shouldn't change? E.g. here the rename from self.dropout_prob to self.attention_dropout could be backwards incompatible

in theory it's OK, in the sense that we can still read old hub configs and make them work with the new code -it is backwards-compatible in that sense. So it's a minor breakage I'd say, also would allow to be more aligned with ViT naming-wise.

guarin · 2026-05-29T08:18:27Z

+        if isinstance(module, Dinov2Embeddings):
+            init.trunc_normal_(module.position_embeddings, mean=0.0, std=self.config.initializer_range)
+            init.trunc_normal_(module.cls_token, mean=0.0, std=self.config.initializer_range)
+            init.zeros_(module.mask_token)


Looks like the custom init is already correctly inherited from ViTPreTrainedModel and we don't have to overwrite it. The if xyz is not None checks will always pass.

guarin · 2026-05-29T08:21:32Z

+        use_mask_token (`bool`, *optional*, defaults to `False`):
+            Whether to use a mask token for masked image modeling.


Remove use_mask_token from the docstring or mention that it is ignored? add_pooling_layer also doesn't seem to be used

well it'll be used in the end

(not pooling_layer though)

guarin · 2026-05-29T08:28:11Z

+        self.pooler = None
+        self.encoder = Dinov2Encoder(config)


This is super minor but it feels off if modules are not declared in the order they are accessed. For example now self.encoder is declared after self.layernorm. This impacts module printing and some torch utils which rely on order of modules. I also don't see self.layers being accessed, where is it needed?

Modules should absolutely be declared in inheritance order! PR is in draft so haven't checked yet, but yes. For self.layers it's an inheritance from VitModel

guarin · 2026-05-29T08:45:20Z

+        num_patches = self.patch_embeddings.num_patches
+        self.position_embeddings = nn.Parameter(torch.randn(1, num_patches + 1, config.hidden_size))


Inline num_patches?

guarin · 2026-05-29T09:09:33Z

+        if isinstance(module, Dinov2WithRegistersEmbeddings):
            init.trunc_normal_(module.position_embeddings, mean=0.0, std=self.config.initializer_range)


Maybe also not needed

guarin · 2026-05-29T09:12:29Z

-
-        self.num_register_tokens = config.num_register_tokens


Removal of self.num_register_tokens might also break backwards compat

that can minor-ly, yes, I'm pretty sure this PR needs 🚨 🚨 because we might not be able to get around some breakage (even if we keep all old attributes)

guarin · 2026-05-29T09:14:25Z

    def get_input_embeddings(self):
        return self.embeddings.patch_embeddings

-    @can_return_tuple


Is can_return_tuple not needed anymore?

@capture_outputs(tie_last_hidden_states=False) supersedes it!

guarin · 2026-05-29T09:18:59Z

-        torch.testing.assert_close(predicted_depth[0, :3, :3], expected_slice, rtol=1e-6, atol=1e-6)
+        torch.testing.assert_close(predicted_depth[0, :3, :3], expected_slice, rtol=1e-4, atol=1e-4)


Why is tolerance so much higher?

ah, because it was broken, haha. Noticed that when running on the DGX

E AssertionError: Tensor-likes are not close! E E Mismatched elements: 8 / 9 (88.9%) E Greatest absolute difference: 5.53131103515625e-05 at index (0, 2) (up to 1e-06 allowed) E Greatest relative difference: 6.415643383661518e-06 at index (0, 2) (up to 1e-06 allowed)

(this is on main). Related because dinov2 is the main backbone

github-actions · 2026-06-04T16:16:37Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: depth_anything, dinov2, dinov2_with_registers, dinov3_convnext, dinov3_vit, eomt, eomt_dinov3, pixio, rf_detr, sapiens2, videomt

molbap added 5 commits May 28, 2026 14:40

First draft + todo

94f8c0e

propagate changes, fix tests

c3c3b3a

mask token?

1978ae7

conversion bug

feea8f6

update decorators again, keep Encoders

8743081

guarin reviewed May 29, 2026

View reviewed changes

molbap mentioned this pull request May 29, 2026

Add Sapiens2 Model #45919

Merged

39 tasks

molbap added 5 commits June 1, 2026 17:24

simplifications

8acf60f

attention mask

2b3b370

registers conversion

e1191d9

Merge branch 'main' into improve_dinos

6a0a80d

fixup merge

b512098

molbap marked this pull request as ready for review June 4, 2026 13:04

molbap changed the title ~~Improve dinos~~ 🚨 🚨Bring some dinos to modern standards Jun 4, 2026

fall back to usual hidden states routing for now

1e29c49

		self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size))
		self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size))

		use_mask_token (`bool`, optional, defaults to `False`):
		Whether to use a mask token for masked image modeling.

		num_patches = self.patch_embeddings.num_patches
		self.position_embeddings = nn.Parameter(torch.randn(1, num_patches + 1, config.hidden_size))

		if isinstance(module, Dinov2WithRegistersEmbeddings):
		init.trunc_normal_(module.position_embeddings, mean=0.0, std=self.config.initializer_range)

		torch.testing.assert_close(predicted_depth[0, :3, :3], expected_slice, rtol=1e-6, atol=1e-6)
		torch.testing.assert_close(predicted_depth[0, :3, :3], expected_slice, rtol=1e-4, atol=1e-4)

Conversation

molbap commented May 28, 2026

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

guarin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants