🚨 🚨Bring some dinos to modern standards#46266
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
guarin
left a comment
There was a problem hiding this comment.
Thanks, this will make things much easier! Left more questions than comments :)
| self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size)) | ||
| self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size)) |
There was a problem hiding this comment.
Might this break things downstream if now every model expects mask token to exist?
There was a problem hiding this comment.
totally, missing ternary with None default
| self.all_head_size = self.num_attention_heads * self.attention_head_size | ||
| self.dropout_prob = config.attention_probs_dropout_prob | ||
| self.scaling = self.attention_head_size**-0.5 | ||
| self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads) | ||
| self.attention_dropout = config.attention_probs_dropout_prob | ||
| self.scaling = self.head_dim**-0.5 |
There was a problem hiding this comment.
Do we consider attributes as part of the API that shouldn't change? E.g. here the rename from self.dropout_prob to self.attention_dropout could be backwards incompatible
There was a problem hiding this comment.
in theory it's OK, in the sense that we can still read old hub configs and make them work with the new code -it is backwards-compatible in that sense. So it's a minor breakage I'd say, also would allow to be more aligned with ViT naming-wise.
| if isinstance(module, Dinov2Embeddings): | ||
| init.trunc_normal_(module.position_embeddings, mean=0.0, std=self.config.initializer_range) | ||
| init.trunc_normal_(module.cls_token, mean=0.0, std=self.config.initializer_range) | ||
| init.zeros_(module.mask_token) |
There was a problem hiding this comment.
Looks like the custom init is already correctly inherited from ViTPreTrainedModel and we don't have to overwrite it. The if xyz is not None checks will always pass.
| use_mask_token (`bool`, *optional*, defaults to `False`): | ||
| Whether to use a mask token for masked image modeling. |
There was a problem hiding this comment.
Remove use_mask_token from the docstring or mention that it is ignored? add_pooling_layer also doesn't seem to be used
There was a problem hiding this comment.
well it'll be used in the end
There was a problem hiding this comment.
(not pooling_layer though)
| self.pooler = None | ||
| self.encoder = Dinov2Encoder(config) |
There was a problem hiding this comment.
This is super minor but it feels off if modules are not declared in the order they are accessed. For example now self.encoder is declared after self.layernorm. This impacts module printing and some torch utils which rely on order of modules. I also don't see self.layers being accessed, where is it needed?
There was a problem hiding this comment.
Modules should absolutely be declared in inheritance order! PR is in draft so haven't checked yet, but yes. For self.layers it's an inheritance from VitModel
| num_patches = self.patch_embeddings.num_patches | ||
| self.position_embeddings = nn.Parameter(torch.randn(1, num_patches + 1, config.hidden_size)) |
| if isinstance(module, Dinov2WithRegistersEmbeddings): | ||
| init.trunc_normal_(module.position_embeddings, mean=0.0, std=self.config.initializer_range) |
|
|
||
| self.num_register_tokens = config.num_register_tokens |
There was a problem hiding this comment.
Removal of self.num_register_tokens might also break backwards compat
There was a problem hiding this comment.
that can minor-ly, yes, I'm pretty sure this PR needs 🚨 🚨 because we might not be able to get around some breakage (even if we keep all old attributes)
| def get_input_embeddings(self): | ||
| return self.embeddings.patch_embeddings | ||
|
|
||
| @can_return_tuple |
There was a problem hiding this comment.
Is can_return_tuple not needed anymore?
There was a problem hiding this comment.
@capture_outputs(tie_last_hidden_states=False) supersedes it!
| torch.testing.assert_close(predicted_depth[0, :3, :3], expected_slice, rtol=1e-6, atol=1e-6) | ||
| torch.testing.assert_close(predicted_depth[0, :3, :3], expected_slice, rtol=1e-4, atol=1e-4) |
There was a problem hiding this comment.
Why is tolerance so much higher?
There was a problem hiding this comment.
ah, because it was broken, haha. Noticed that when running on the DGX
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 8 / 9 (88.9%)
E Greatest absolute difference: 5.53131103515625e-05 at index (0, 2) (up to 1e-06 allowed)
E Greatest relative difference: 6.415643383661518e-06 at index (0, 2) (up to 1e-06 allowed)(this is on main). Related because dinov2 is the main backbone
|
[For maintainers] Suggested jobs to run (before merge) run-slow: depth_anything, dinov2, dinov2_with_registers, dinov3_convnext, dinov3_vit, eomt, eomt_dinov3, pixio, rf_detr, sapiens2, videomt |
What does this PR do?
Part of the larger vision model refactor #41693 focused on dinov2, which has still some usage and downloads, but mostly serves as a basis for many other models. Attempt at putting this in line with the rest of the lib.