Skip to content

Processing Utils: honor pre-built sub-processor kwargs in from_pretrained#45627

Open
javierdejesusda wants to merge 1 commit intohuggingface:mainfrom
javierdejesusda:fix/autoprocessor-honor-tokenizer-kwarg-44987
Open

Processing Utils: honor pre-built sub-processor kwargs in from_pretrained#45627
javierdejesusda wants to merge 1 commit intohuggingface:mainfrom
javierdejesusda:fix/autoprocessor-honor-tokenizer-kwarg-44987

Conversation

@javierdejesusda
Copy link
Copy Markdown
Contributor

What does this PR do?

Addresses @ArthurZucker's follow-up on #44987. When a caller passes a pre-built sub-processor to AutoProcessor.from_pretrained — e.g. tokenizer=tok or bpe_tokenizer=tok — the instance is now used directly instead of being silently forwarded into the sub-loader calls.

Exact attribute names always take precedence (bpe_tokenizer=). For processors with a single sub-processor of a given modality, the canonical modality name (tokenizer=) is also accepted as an alias — this matches the reproducer in the issue comment, where the remote UniversalActionProcessor takes bpe_tokenizer but the caller passes tokenizer=.

The OP's zero-kwarg reproducer (AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)) is a separate hub repo layout issue and is not addressed here.

Before submitting

Who can review?

cc @ArthurZucker @itazap

…ined

When a caller passes a pre-built sub-processor via kwargs to
`AutoProcessor.from_pretrained` (e.g. `tokenizer=tok` or `bpe_tokenizer=tok`),
use the instance directly instead of silently forwarding it into the
sub-loader calls. Exact attribute names take precedence; the canonical
modality name is also accepted as an alias when a single sub-processor has
that modality.
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Comment on lines +1438 to +1452
def _pop_prebuilt_subprocessors(cls, kwargs: dict) -> dict:
"""Pop pre-built sub-processors from `kwargs` by exact attribute name, or by modality
alias (e.g. `tokenizer=` → `bpe_tokenizer`) when that modality is unambiguous.
"""
sub_processors = cls.get_attributes()
modality_counts = Counter(_get_modality_for_attribute(s) for s in sub_processors)
prebuilt = {}
for sub_processor_type in sub_processors:
modality = _get_modality_for_attribute(sub_processor_type)
instance = kwargs.pop(sub_processor_type, None)
if instance is None and modality != sub_processor_type and modality_counts[modality] == 1:
instance = kwargs.pop(modality, None)
if instance is not None:
prebuilt[sub_processor_type] = instance
return prebuilt
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure about this. The error from GH issue is due to old remote code, and we don't yet support the Pi0-FAST natively in transformers. also cc @yonigozlan ig you might have seen similar issue when refactoring processor loading

We're planning native support though, and waiting for lerobot team to test and convert the configs correctly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @zucchini-nlp!

Quick scope note: this PR isn't targeting the OP's zero-kwarg traceback (that's the hub layout / old remote code path you mentioned, which I agree is out of scope here and will be obsoleted by native support). It's targeting @ArthurZucker's follow-up comment on the issue:

p = AutoProcessor.from_pretrained("physical-intelligence/fast", tokenizer=tokenizer, trust_remote_code=True, use_fast=False)

"this does not work and it should!"

The underlying behavior is general to ProcessorMixin: when a caller supplies a pre-built sub-processor via kwargs (whether tokenizer= or the exact attribute name like bpe_tokenizer=), the instance is silently dropped and the loader tries to reload from disk anyway. Any processor with a non-primary tokenizer attribute runs into this, so native Pi0-FAST support wouldn't fix it on its own, it'd just mean one fewer processor hitting it.

That said, happy to defer fully. If you and @yonigozlan / @ArthurZucker feel this should wait (or be folded into the native support work, or handled differently), I'm glad to close or rescope, just let me know.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, totally get it. Personally, I think we can deliberately not support it as remote code and not-v5 compatible unless Arthur/Yoni have a different opinion

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants