[Model] Add PP-DocLayoutV3 Model Support#43098
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
hey! can you provide a bit of contexte for us? 🤗 a link to the model being released!
Also the code looks very far aways from https://huggingface.co/docs/transformers/v4.48.0/modular_transformers which I inviite you to read to get a better idea of how to adapt the code!
PP-DocLayoutV3 model_doc and huggingface repo are comming soon. Could I please request a review? |
molbap
left a comment
There was a problem hiding this comment.
Hello, looking forward to see this out! Wrote a first review, don't forget to fill the toctree and add the model documentation as well! Ping me when done and I'll re-review 🤗
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
| inputs_embeds: Optional[torch.FloatTensor] = None, | ||
| decoder_inputs_embeds: Optional[torch.FloatTensor] = None, |
There was a problem hiding this comment.
seems that both of these are unused no?
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
| # Lowest resolution feature maps are obtained via 3x3 stride 2 convolutions on the final stage | ||
| if self.config.num_feature_levels > len(sources): | ||
| _len_sources = len(sources) | ||
| sources.append(self.decoder_input_proj[_len_sources](encoder_outputs[0])[-1]) |
There was a problem hiding this comment.
I'm a bit confused here, will that run? isn't encoder_outputs[0] a list?
There was a problem hiding this comment.
This is from the RT-DETR code. I think there’s a typo here. It’s currently written as (encoder_outputs[0])[-1], but I believe it should be (encoder_outputs[0][-1]). Our current configuration doesn’t hit this branch, but it seemed like a good fix to make, so I’ve made the change.
There was a problem hiding this comment.
I see... due to this model and v2 having a lot in common with RT-DETR @yonigozlan is overhauling it over there #41549 if you want to take a look!
There was a problem hiding this comment.
Cool, it seems that at least output_hidden_states and output_attentions can be removed. Once #41549 is merged, PP-DocLayoutV2 and PP-DocLayoutV3 will follow up with these changes.
| cxcy, wh = torch.split(boxes, 2, dim=-1) | ||
| boxes = torch.cat([cxcy - 0.5 * wh, cxcy + 0.5 * wh], dim=-1) |
There was a problem hiding this comment.
full words for variable names, please
| B, N, _ = inputs.shape | ||
| proj = self.dense(inputs).reshape([B, N, 2, self.head_size]) | ||
| proj = self.dropout(proj) | ||
| qw, kw = proj[..., 0, :], proj[..., 1, :] |
There was a problem hiding this comment.
same, no single letter variables
| proj = self.dropout(proj) | ||
| qw, kw = proj[..., 0, :], proj[..., 1, :] | ||
|
|
||
| logits = torch.einsum("bmd,bnd->bmn", qw, kw) / (self.head_size**0.5) # [B, N, N] |
There was a problem hiding this comment.
In 98% of situations we don't want einsum paths in inference code if an alternative exists, for both readability and performance reasons
| def forward(self, x): | ||
| # use 'x * F.sigmoid(x)' replace 'silu' | ||
| x = self.bn(self.conv(x)) | ||
| y = x * F.sigmoid(x) | ||
| return y |
There was a problem hiding this comment.
The standard interface looks more like this (example from RTDetr which you can use directly in modular, actually):
class RTDetrResNetConvLayer(nn.Module):
def __init__(
self, in_channels: int, out_channels: int, kernel_size: int = 3, stride: int = 1, activation: str = "relu"
):
super().__init__()
self.convolution = nn.Conv2d(
in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=kernel_size // 2, bias=False
)
self.normalization = nn.BatchNorm2d(out_channels)
self.activation = ACT2FN[activation] if activation is not None else nn.Identity()
def forward(self, input: Tensor) -> Tensor:
hidden_state = self.convolution(input)
hidden_state = self.normalization(hidden_state)
hidden_state = self.activation(hidden_state)
return hidden_stateand you can tune the activation function you want (sigmoid here) based on config.
| def get_order_seqs(order_logits): | ||
| order_scores = torch.sigmoid(order_logits) | ||
| batch_size, sequence_length, _ = order_scores.shape | ||
|
|
||
| one = torch.ones((sequence_length, sequence_length), dtype=order_scores.dtype, device=order_scores.device) | ||
| upper = torch.triu(one, 1) | ||
| lower = torch.tril(one, -1) | ||
|
|
||
| Q = order_scores * upper + (1.0 - order_scores.transpose(1, 2)) * lower | ||
| order_votes = Q.sum(dim=1) | ||
|
|
||
| order_pointers = torch.argsort(order_votes, dim=1) | ||
| order_seq = torch.full_like(order_pointers, -1) | ||
| batch = torch.arange(batch_size, device=order_pointers.device)[:, None] | ||
| order_seq[batch, order_pointers] = torch.arange(sequence_length, device=order_pointers.device)[None, :] | ||
|
|
||
| return order_seq |
There was a problem hiding this comment.
a few things here that got me thinking.( this should be likely a private method). instead of transposition you could use tril.
Then if I understand this correctly, I don't think we need to materialize Q, you could just do something like this?
order_votes = (
order_scores.triu(diagonal=1).sum(dim=1)
+ (1.0 - order_scores.transpose(1, 2)).tril(diagonal=-1).sum(dim=1)
)and the permutation after argsort can be something like
order_seq = torch.empty_like(order_pointers)
ranks = torch.arange(sequence_length, device=order_pointers.device, dtype=order_pointers.dtype).expand(batch_size, -1)
order_seq.scatter_(1, order_pointers, ranks)There was a problem hiding this comment.
excellent suggestion, thank you. I’ve made the change.
|
@molbap |
molbap
left a comment
There was a problem hiding this comment.
Added another review, now let's see with the DETR refactor changes that will be merged very soon!
|
|
||
| ## Overview | ||
|
|
||
| TBD. |
There was a problem hiding this comment.
Will need a small abstract, and a code usage snippet
There was a problem hiding this comment.
Sure, but that may be later, because the model has not been released yet.
There was a problem hiding this comment.
current usage snippets are good!
| @@ -0,0 +1,47 @@ | |||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | |||
There was a problem hiding this comment.
Let's update when relevant
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | |
| <!--Copyright 2026 The HuggingFace Team. All rights reserved. |
| self.resample = resample | ||
|
|
||
| def _get_order_seqs(self, order_logits): | ||
| order_scores = torch.sigmoid(order_logits) |
There was a problem hiding this comment.
much better. Let's add a small docstring to explain what that does for later code inspectors. However I just noticed that the ImageProcessor (not Fast) was using torch + torchvision, the idea is that the fast image processor use torch/torchvision and is as GPU-compatible as possible, but the "slow" image processor is cpu-bound, using PIL and numpy operations only.
So the Fast Image processor (which should be the default especially if it matches the processing) is ok, but if you want to include a "slow" one, it should be without torch/torchvision
There was a problem hiding this comment.
emmm. order_logits is a tensor, so, can I refer to RT-DETR and use requires_backends(self, ["torch"]) to solve this problem?
There was a problem hiding this comment.
ah yes indeed, you do need to do some torch operations. In this case, I suggest simply not having the non-fast processor, it doesn't make much sense to have it here. In the mapping, it will be (None, ...ImageProcessorFast)
There was a problem hiding this comment.
well, so, should I remove PPDocLayoutV3ImageProcessor and only keep PPDocLayoutV3ImageProcessorFast?
There was a problem hiding this comment.
See my comment in auto, if there is no strict reason to keep a slow processor, we can IMO remove the slow one
|
|
||
| logits = (queries @ keys.transpose(-2, -1)) / (self.head_size**0.5) | ||
| lower = torch.tril(torch.ones([sequence_length, sequence_length], dtype=logits.dtype, device=logits.device)) | ||
| logits = logits - lower.unsqueeze(0) * 1e4 |
There was a problem hiding this comment.
is this 1e4 value hardcoded? would be nice to name it and have it in configuration
There was a problem hiding this comment.
We’ll apply sigmoid later, so the upper triangular matrix should be masked with a very small value, no additional configuration needed.
| for key, tensor in inputs_dict.items(): | ||
| if tensor.dtype == torch.float32: | ||
| inputs_dict[key] = tensor.to(dtype) |
There was a problem hiding this comment.
why are we converting to the wanted dtype only in float32?
| # TODO: | ||
| # @require_torch | ||
| # @require_vision | ||
| # @slow | ||
| # class PPDocLayoutV3ModelIntegrationTest(unittest.TestCase): |
There was a problem hiding this comment.
TODO I suppose! The best option is to take an image and output you expect, you can put the image on a hub dataset and request it in the test, many tests are build like that - it's the most reliable way
There was a problem hiding this comment.
same as model_doc
| @@ -0,0 +1,446 @@ | |||
| # coding = utf-8 | |||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |
| # Copyright 2026 The HuggingFace Inc. team. All rights reserved. |
| if is_vision_available(): | ||
| pass |
There was a problem hiding this comment.
| if is_vision_available(): | |
| pass |
| return logits | ||
|
|
||
|
|
||
| class PPDocLayoutV3MultiscaleDeformableAttention(RTDetrMultiscaleDeformableAttention): |
There was a problem hiding this comment.
could this inherit from rt detr v2 rather than v1?
There was a problem hiding this comment.
What is the reason for this change? this MultiscaleDeformableAttention inheriting from rt_detr_v2 is not working, cause our model’s structure is like rt_detr rather than rt_detr_v2.
There was a problem hiding this comment.
it was just out of curiosity as v2 is a bit more "modern". In any case for now it seems good, we will need #41549 to be merged and then we can merge this !
There was a problem hiding this comment.
as explained, in the end no need for merging this
|
@molbap |
vasqu
left a comment
There was a problem hiding this comment.
We really need the rt detr refactor to make this aligned with our modern code base cc @yonigozlan to keep this model in mind
In general, I've added a lot of comments to what ideally it should be. I know that some stuff cannot be changed due to how we depend on rt detr
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
| *This model was released on 2026-x-x and added to Hugging Face Transformers on 2026-x-x.* |
There was a problem hiding this comment.
Just fyi, we changed some CI rules so might complain, you can add arbitrary values first if needed (for release)
There was a problem hiding this comment.
Okay, let me fill it in first. I’ll come back to modify it later when merging.
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
| self.resample = resample | ||
|
|
||
| def _get_order_seqs(self, order_logits): | ||
| order_scores = torch.sigmoid(order_logits) |
There was a problem hiding this comment.
See my comment in auto, if there is no strict reason to keep a slow processor, we can IMO remove the slow one
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
|
@vasqu I have modified the code according to your suggestions. PTAL. |
molbap
left a comment
There was a problem hiding this comment.
Hey just to clarify @zhang-prog, after internal talking with @vasqu and @yonigozlan - we will merge as-is in the end, not waiting for the DETRE refactor to be done, that way code can be up as soon as possible and we'll refactor later. so once all comments are addressed it should be fine. Let's make sure it's as polished as possible before the finish line!
|
|
||
| ## Overview | ||
|
|
||
| TBD. |
There was a problem hiding this comment.
current usage snippets are good!
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
| return logits | ||
|
|
||
|
|
||
| class PPDocLayoutV3MultiscaleDeformableAttention(RTDetrMultiscaleDeformableAttention): |
There was a problem hiding this comment.
as explained, in the end no need for merging this
|
|
||
|
|
||
| @auto_docstring | ||
| class PPDocLayoutV3PreTrainedModel(PreTrainedModel): |
| init.zeros_(module.weight.data[module.padding_idx]) | ||
|
|
||
|
|
||
| def mask_to_box_coordinate(mask, dtype): |
There was a problem hiding this comment.
for later (cc @yonigozlan ) I think we can extract a shared utility to image_utils around methods like this one
vasqu
left a comment
There was a problem hiding this comment.
Happy with the current state, mainly nits left re abbreviations and possibly splitting the config if bandwidth allows us to
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,446 @@ | |||
| # coding = utf-8 | |||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |||
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
|
|
||
| @dataclass | ||
| @auto_docstring | ||
| class PPDocLayoutV3ForObjectDetectionOutput(ModelOutput): |
There was a problem hiding this comment.
Could we inherit from something? Also seeing some of the output types inheriting already - maybe we can reduce repeating some parts then, not sure if autodoc handles this properly cc @yonigozlan
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
molbap
left a comment
There was a problem hiding this comment.
Good for me too! two nits related to model name in the doc, and an import thing
CI Results✅ No failing test specific to this PR 🎉 ! |
|
@zhang-prog Are we waiting for the weights or will they be released after merge? Re slow integration tests |
@vasqu We will get back to you after we have an internal discussion. |
|
No worries, you can also write us on slack then 🤗 |
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
|
@molbap hi, Pablo, we have made three changes, and this should be the final update:
Ready for review, PTAL! |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43098&sha=cbbc37 |
molbap
left a comment
There was a problem hiding this comment.
Mostly naming-related comments + an important thing, prefixes to be fixed. Thank you!
| "num_attention_heads": "encoder_attention_heads", | ||
| } | ||
|
|
||
| def __init__( |
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
| _no_split_modules = [r"PPDocLayoutV3HybridEncoder", r"PPDocLayoutV3DecoderLayer"] | ||
|
|
||
| @torch.no_grad() | ||
| def _init_weights(self, module): |
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
vasqu
left a comment
There was a problem hiding this comment.
A lot of smaller things, 1 thing I'm unsure about is whether the model should tie weights - at least it looks like that to me
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
src/transformers/models/pp_doclayout_v3/modular_pp_doclayout_v3.py
Outdated
Show resolved
Hide resolved
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, pp_doclayout_v3 |
|
run-slow: pp_doclayout_v3 |
|
This comment contains models: ["models/pp_doclayout_v3"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
Merging, CI is on fire these days and it only touches the new model which passes all tests - thanks a lot @zhang-prog |

No description provided.