Skip to content

[Model] Add PP-DocLayoutV3 Model Support#43098

Merged
vasqu merged 24 commits intohuggingface:mainfrom
zhang-prog:feat/pp_doclayout_v3
Jan 29, 2026
Merged

[Model] Add PP-DocLayoutV3 Model Support#43098
vasqu merged 24 commits intohuggingface:mainfrom
zhang-prog:feat/pp_doclayout_v3

Conversation

@zhang-prog
Copy link
Contributor

No description provided.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey! can you provide a bit of contexte for us? 🤗 a link to the model being released!
Also the code looks very far aways from https://huggingface.co/docs/transformers/v4.48.0/modular_transformers which I inviite you to read to get a better idea of how to adapt the code!

@zhang-prog
Copy link
Contributor Author

@ArthurZucker

  1. The model is not yet publicly released, but we aim to merge it into the transformers library as soon as possible.
  2. I’ve adopted a modular approach and further optimized the code.

PP-DocLayoutV3 model_doc and huggingface repo are comming soon. Could I please request a review?

@molbap molbap self-assigned this Jan 12, 2026
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, looking forward to see this out! Wrote a first review, don't forget to fill the toctree and add the model documentation as well! Ping me when done and I'll re-review 🤗

Comment on lines 1375 to 1376
inputs_embeds: Optional[torch.FloatTensor] = None,
decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that both of these are unused no?

# Lowest resolution feature maps are obtained via 3x3 stride 2 convolutions on the final stage
if self.config.num_feature_levels > len(sources):
_len_sources = len(sources)
sources.append(self.decoder_input_proj[_len_sources](encoder_outputs[0])[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here, will that run? isn't encoder_outputs[0] a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the RT-DETR code. I think there’s a typo here. It’s currently written as (encoder_outputs[0])[-1], but I believe it should be (encoder_outputs[0][-1]). Our current configuration doesn’t hit this branch, but it seemed like a good fix to make, so I’ve made the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... due to this model and v2 having a lot in common with RT-DETR @yonigozlan is overhauling it over there #41549 if you want to take a look!

Copy link
Contributor Author

@zhang-prog zhang-prog Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, it seems that at least output_hidden_states and output_attentions can be removed. Once #41549 is merged, PP-DocLayoutV2 and PP-DocLayoutV3 will follow up with these changes.

Comment on lines 693 to 694
cxcy, wh = torch.split(boxes, 2, dim=-1)
boxes = torch.cat([cxcy - 0.5 * wh, cxcy + 0.5 * wh], dim=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full words for variable names, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

Comment on lines 742 to 745
B, N, _ = inputs.shape
proj = self.dense(inputs).reshape([B, N, 2, self.head_size])
proj = self.dropout(proj)
qw, kw = proj[..., 0, :], proj[..., 1, :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, no single letter variables

proj = self.dropout(proj)
qw, kw = proj[..., 0, :], proj[..., 1, :]

logits = torch.einsum("bmd,bnd->bmn", qw, kw) / (self.head_size**0.5) # [B, N, N]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 98% of situations we don't want einsum paths in inference code if an alternative exists, for both readability and performance reasons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 963 to 967
def forward(self, x):
# use 'x * F.sigmoid(x)' replace 'silu'
x = self.bn(self.conv(x))
y = x * F.sigmoid(x)
return y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard interface looks more like this (example from RTDetr which you can use directly in modular, actually):

class RTDetrResNetConvLayer(nn.Module):
    def __init__(
        self, in_channels: int, out_channels: int, kernel_size: int = 3, stride: int = 1, activation: str = "relu"
    ):
        super().__init__()
        self.convolution = nn.Conv2d(
            in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=kernel_size // 2, bias=False
        )
        self.normalization = nn.BatchNorm2d(out_channels)
        self.activation = ACT2FN[activation] if activation is not None else nn.Identity()

    def forward(self, input: Tensor) -> Tensor:
        hidden_state = self.convolution(input)
        hidden_state = self.normalization(hidden_state)
        hidden_state = self.activation(hidden_state)
        return hidden_state

and you can tune the activation function you want (sigmoid here) based on config.

Comment on lines 32 to 48
def get_order_seqs(order_logits):
order_scores = torch.sigmoid(order_logits)
batch_size, sequence_length, _ = order_scores.shape

one = torch.ones((sequence_length, sequence_length), dtype=order_scores.dtype, device=order_scores.device)
upper = torch.triu(one, 1)
lower = torch.tril(one, -1)

Q = order_scores * upper + (1.0 - order_scores.transpose(1, 2)) * lower
order_votes = Q.sum(dim=1)

order_pointers = torch.argsort(order_votes, dim=1)
order_seq = torch.full_like(order_pointers, -1)
batch = torch.arange(batch_size, device=order_pointers.device)[:, None]
order_seq[batch, order_pointers] = torch.arange(sequence_length, device=order_pointers.device)[None, :]

return order_seq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few things here that got me thinking.( this should be likely a private method). instead of transposition you could use tril.
Then if I understand this correctly, I don't think we need to materialize Q, you could just do something like this?

    order_votes = (
        order_scores.triu(diagonal=1).sum(dim=1)
        + (1.0 - order_scores.transpose(1, 2)).tril(diagonal=-1).sum(dim=1)
    )

and the permutation after argsort can be something like

    order_seq = torch.empty_like(order_pointers)
    ranks = torch.arange(sequence_length, device=order_pointers.device, dtype=order_pointers.dtype).expand(batch_size, -1)
    order_seq.scatter_(1, order_pointers, ranks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent suggestion, thank you. I’ve made the change.

@zhang-prog
Copy link
Contributor Author

@molbap
Thank you for your effort! I’ve made changes based on your comments.
PTAL.

@zhang-prog zhang-prog requested a review from molbap January 13, 2026 12:37
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another review, now let's see with the DETR refactor changes that will be merged very soon!


## Overview

TBD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need a small abstract, and a code usage snippet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that may be later, because the model has not been released yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current usage snippets are good!

@@ -0,0 +1,47 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update when relevant

Suggested change
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
<!--Copyright 2026 The HuggingFace Team. All rights reserved.

self.resample = resample

def _get_order_seqs(self, order_logits):
order_scores = torch.sigmoid(order_logits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better. Let's add a small docstring to explain what that does for later code inspectors. However I just noticed that the ImageProcessor (not Fast) was using torch + torchvision, the idea is that the fast image processor use torch/torchvision and is as GPU-compatible as possible, but the "slow" image processor is cpu-bound, using PIL and numpy operations only.

So the Fast Image processor (which should be the default especially if it matches the processing) is ok, but if you want to include a "slow" one, it should be without torch/torchvision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emmm. order_logits is a tensor, so, can I refer to RT-DETR and use requires_backends(self, ["torch"]) to solve this problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes indeed, you do need to do some torch operations. In this case, I suggest simply not having the non-fast processor, it doesn't make much sense to have it here. In the mapping, it will be (None, ...ImageProcessorFast)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, so, should I remove PPDocLayoutV3ImageProcessor and only keep PPDocLayoutV3ImageProcessorFast?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in auto, if there is no strict reason to keep a slow processor, we can IMO remove the slow one


logits = (queries @ keys.transpose(-2, -1)) / (self.head_size**0.5)
lower = torch.tril(torch.ones([sequence_length, sequence_length], dtype=logits.dtype, device=logits.device))
logits = logits - lower.unsqueeze(0) * 1e4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this 1e4 value hardcoded? would be nice to name it and have it in configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ll apply sigmoid later, so the upper triangular matrix should be masked with a very small value, no additional configuration needed.

Comment on lines 281 to 283
for key, tensor in inputs_dict.items():
if tensor.dtype == torch.float32:
inputs_dict[key] = tensor.to(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we converting to the wanted dtype only in float32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 442 to 446
# TODO:
# @require_torch
# @require_vision
# @slow
# class PPDocLayoutV3ModelIntegrationTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO I suppose! The best option is to take an image and output you expect, you can put the image on a hub dataset and request it in the test, many tests are build like that - it's the most reliable way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as model_doc

@@ -0,0 +1,446 @@
# coding = utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping

Comment on lines 44 to 45
if is_vision_available():
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_vision_available():
pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return logits


class PPDocLayoutV3MultiscaleDeformableAttention(RTDetrMultiscaleDeformableAttention):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this inherit from rt detr v2 rather than v1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change? this MultiscaleDeformableAttention inheriting from rt_detr_v2 is not working, cause our model’s structure is like rt_detr rather than rt_detr_v2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was just out of curiosity as v2 is a bit more "modern". In any case for now it seems good, we will need #41549 to be merged and then we can merge this !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as explained, in the end no need for merging this

@zhang-prog
Copy link
Contributor Author

@molbap
Model overview and PPDocLayoutV3ModelIntegrationTest cannot be completed now, but they are coming soon.
PTAL.

@zhang-prog zhang-prog requested a review from molbap January 14, 2026 10:53
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need the rt detr refactor to make this aligned with our modern code base cc @yonigozlan to keep this model in mind

In general, I've added a lot of comments to what ideally it should be. I know that some stuff cannot be changed due to how we depend on rt detr

rendered properly in your Markdown viewer.

-->
*This model was released on 2026-x-x and added to Hugging Face Transformers on 2026-x-x.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fyi, we changed some CI rules so might complain, you can add arbitrary values first if needed (for release)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me fill it in first. I’ll come back to modify it later when merging.

self.resample = resample

def _get_order_seqs(self, order_logits):
order_scores = torch.sigmoid(order_logits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in auto, if there is no strict reason to keep a slow processor, we can IMO remove the slow one

@zhang-prog
Copy link
Contributor Author

@vasqu I have modified the code according to your suggestions. PTAL.

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey just to clarify @zhang-prog, after internal talking with @vasqu and @yonigozlan - we will merge as-is in the end, not waiting for the DETRE refactor to be done, that way code can be up as soon as possible and we'll refactor later. so once all comments are addressed it should be fine. Let's make sure it's as polished as possible before the finish line!


## Overview

TBD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current usage snippets are good!

return logits


class PPDocLayoutV3MultiscaleDeformableAttention(RTDetrMultiscaleDeformableAttention):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as explained, in the end no need for merging this



@auto_docstring
class PPDocLayoutV3PreTrainedModel(PreTrainedModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion on that: why not RTDetrPreTrainedModel? from my analysis it is close enough, just one initialization that might be needed to override

Image

init.zeros_(module.weight.data[module.padding_idx])


def mask_to_box_coordinate(mask, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for later (cc @yonigozlan ) I think we can extract a shared utility to image_utils around methods like this one

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with the current state, mainly nits left re abbreviations and possibly splitting the config if bandwidth allows us to

@@ -0,0 +1,446 @@
# coding = utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping


@dataclass
@auto_docstring
class PPDocLayoutV3ForObjectDetectionOutput(ModelOutput):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we inherit from something? Also seeing some of the output types inheriting already - maybe we can reduce repeating some parts then, not sure if autodoc handles this properly cc @yonigozlan

@zhang-prog
Copy link
Contributor Author

@molbap @vasqu I’ve updated the code based on your suggestions. Let’s take a look at the latest version!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me too! two nits related to model name in the doc, and an import thing

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@vasqu
Copy link
Contributor

vasqu commented Jan 22, 2026

@zhang-prog Are we waiting for the weights or will they be released after merge? Re slow integration tests

@zhang-prog
Copy link
Contributor Author

@zhang-prog Are we waiting for the weights or will they be released after merge? Re slow integration tests

@vasqu We will get back to you after we have an internal discussion.

@vasqu
Copy link
Contributor

vasqu commented Jan 22, 2026

No worries, you can also write us on slack then 🤗

@zhang-prog
Copy link
Contributor Author

zhang-prog commented Jan 27, 2026

@molbap hi, Pablo, we have made three changes, and this should be the final update:

  1. We overrode the _preprocess method, because we require self.resize(..., antialias=False) to approximate the behavior of cv2.resize in order to ensure model accuracy, but the antialias argument cannot be passed in the current implementation.
  2. The decoder_order_head has been changed from a single nn.Linear layer to a stack of nn.Linear layers, with the depth defined by config.decoder_layers.
  3. We added polygon_points to the post-processing stage, which are derived from the model’s output masks.

Ready for review, PTAL!

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43098&sha=cbbc37

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly naming-related comments + an important thing, prefixes to be fixed. Thank you!

"num_attention_heads": "encoder_attention_heads",
}

def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhang-prog if you have time 🙏

_no_split_modules = [r"PPDocLayoutV3HybridEncoder", r"PPDocLayoutV3DecoderLayer"]

@torch.no_grad()
def _init_weights(self, module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumping 👀

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of smaller things, 1 thing I'm unsure about is whether the model should tie weights - at least it looks like that to me

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, pp_doclayout_v3

@zhang-prog zhang-prog requested review from molbap and vasqu January 29, 2026 09:44
@molbap
Copy link
Contributor

molbap commented Jan 29, 2026

run-slow: pp_doclayout_v3

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/pp_doclayout_v3"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@vasqu
Copy link
Contributor

vasqu commented Jan 29, 2026

Merging, CI is on fire these days and it only touches the new model which passes all tests - thanks a lot @zhang-prog

@vasqu vasqu merged commit d196ee9 into huggingface:main Jan 29, 2026
18 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants