-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Support having multiple sub-processors (of any kind) in the same processor #42667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9c6357c
f3bd01c
c84b564
abd038d
114a48b
51fc687
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,8 +47,6 @@ class LasrProcessorKwargs(ProcessingKwargs, total=False): | |
|
|
||
|
|
||
| class LasrProcessor(ProcessorMixin): | ||
| tokenizer_class = "ParakeetTokenizerFast" | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cc @eustlb Same here, although is this supposed to be Parakeet or lasr tokenizer by default? |
||
|
|
||
| def __init__(self, feature_extractor, tokenizer): | ||
| super().__init__(feature_extractor, tokenizer) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -61,10 +61,6 @@ class Pix2StructProcessor(ProcessorMixin): | |
| An instance of ['T5Tokenizer`]. The tokenizer is a required input. | ||
| """ | ||
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| image_processor_class = "Pix2StructImageProcessor" | ||
| tokenizer_class = ("T5Tokenizer",) | ||
|
|
||
|
Comment on lines
-64
to
-67
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cc @itazap Just to make sure it's safe to remove, was this added back by mistake?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably! if tests pass without it's not needed indeed |
||
| def __init__(self, image_processor, tokenizer): | ||
| tokenizer.return_token_type_ids = False | ||
| super().__init__(image_processor, tokenizer) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -129,6 +129,26 @@ def keys(self): | |
| "video_processor": "BaseVideoProcessor", | ||
| } | ||
|
|
||
|
|
||
| def _get_modality_for_attribute(attribute_name: str) -> str: | ||
| """ | ||
| Get the canonical modality type for a given attribute name. | ||
|
|
||
| For example: | ||
| - "image_processor" -> "image_processor" | ||
| - "encoder_image_processor" -> "image_processor" | ||
| - "text_tokenizer" -> "tokenizer" | ||
| - "my_feature_extractor" -> "feature_extractor" | ||
| """ | ||
| for modality in MODALITY_TO_AUTOPROCESSOR_MAPPING.keys(): | ||
| if modality in attribute_name: | ||
| return modality | ||
| raise ValueError( | ||
| f"Cannot determine modality for attribute '{attribute_name}'. " | ||
| f"Attribute name must contain one of: {list(MODALITY_TO_AUTOPROCESSOR_MAPPING.keys())}" | ||
| ) | ||
|
Comment on lines
+133
to
+149
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. imo this is simplified a lot. Users do not always call attributes following the pattern and also might want to use their own processing classes. There is a lot of inheritance and patching in custom code afaik, which can't be simplified to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I guess attributes is a misnomer here really, and we should maybe change it here to subprocessors, but it was named this before the refactor so I didn't want to change it to not break bc. It might be wort changing it though as it's causing a lot of confusion |
||
|
|
||
|
|
||
| if sys.version_info >= (3, 11): | ||
| Unpack = typing.Unpack | ||
| else: | ||
|
|
@@ -663,8 +683,10 @@ def check_argument_for_proper_class(self, argument_name, argument): | |
| mismatch between expected and actual class, an error is raise. Otherwise, the proper retrieved class | ||
| is returned. | ||
| """ | ||
| if argument_name not in MODALITY_TO_BASE_CLASS_MAPPING and "tokenizer" in argument_name: | ||
| argument_name = "tokenizer" | ||
| # If the exact attribute name is not in the mapping, use its canonical modality | ||
| # (e.g., "encoder_tokenizer" -> "tokenizer") | ||
| if argument_name not in MODALITY_TO_BASE_CLASS_MAPPING: | ||
| argument_name = _get_modality_for_attribute(argument_name) | ||
| class_name = MODALITY_TO_BASE_CLASS_MAPPING.get(argument_name) | ||
| if isinstance(class_name, tuple): | ||
| proper_class = tuple(self.get_possibly_dynamic_module(n) for n in class_name if n is not None) | ||
|
|
@@ -695,9 +717,13 @@ def to_dict(self) -> dict[str, Any]: | |
| # extra attributes to be kept | ||
| attrs_to_save += ["auto_map"] | ||
|
|
||
| # Remove tokenizers from output - they have their own vocab files and are saved separately. | ||
| # All other sub-processors (image_processor, feature_extractor, etc.) are kept in processor_config.json. | ||
| for attribute in self.__class__.get_attributes(): | ||
| if "tokenizer" in attribute and attribute in output: | ||
| del output[attribute] | ||
| if attribute in output: | ||
| modality = _get_modality_for_attribute(attribute) | ||
| if modality == "tokenizer": | ||
| del output[attribute] | ||
|
|
||
| if "chat_template" in output: | ||
| del output["chat_template"] | ||
|
|
@@ -819,13 +845,15 @@ def save_pretrained(self, save_directory, push_to_hub: bool = False, **kwargs): | |
| if hasattr(attribute, "_set_processor_class"): | ||
| attribute._set_processor_class(self.__class__.__name__) | ||
|
|
||
| # Save the tokenizer in its own vocab file. The other attributes are saved as part of `processor_config.json` | ||
| if attribute_name == "tokenizer": | ||
| attribute.save_pretrained(save_directory) | ||
| # if a model has multiple tokenizers, save the additional tokenizers in their own folders. | ||
| # Note that the additional tokenizers must have "tokenizer" in their attribute name. | ||
| elif "tokenizer" in attribute_name: | ||
| attribute.save_pretrained(os.path.join(save_directory, attribute_name)) | ||
| modality = _get_modality_for_attribute(attribute_name) | ||
| is_primary = attribute_name == modality | ||
| if modality == "tokenizer": | ||
| # Save the tokenizer in its own vocab file. The other attributes are saved as part of `processor_config.json` | ||
| if is_primary: | ||
| attribute.save_pretrained(save_directory) | ||
| else: | ||
| # if a model has multiple tokenizers, save the additional tokenizers in their own folders. | ||
| attribute.save_pretrained(os.path.join(save_directory, attribute_name)) | ||
| elif attribute._auto_class is not None: | ||
| custom_object_save(attribute, save_directory, config=attribute) | ||
|
|
||
|
|
@@ -1393,9 +1421,10 @@ def from_pretrained( | |
| if token is not None: | ||
| kwargs["token"] = token | ||
|
|
||
| args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs) | ||
| processor_dict, kwargs = cls.get_processor_dict(pretrained_model_name_or_path, **kwargs) | ||
| return cls.from_args_and_dict(args, processor_dict, **kwargs) | ||
| # Get processor_dict first so we can use it to instantiate non-tokenizer sub-processors | ||
| processor_dict, instantiation_kwargs = cls.get_processor_dict(pretrained_model_name_or_path, **kwargs) | ||
| args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, processor_dict, **kwargs) | ||
| return cls.from_args_and_dict(args, processor_dict, **instantiation_kwargs) | ||
|
|
||
| @classmethod | ||
| def get_attributes(cls): | ||
|
|
@@ -1405,7 +1434,7 @@ def get_attributes(cls): | |
| # don't treat audio_tokenizer as an attribute | ||
| if sub_processor_type == "audio_tokenizer": | ||
| continue | ||
| if sub_processor_type in MODALITY_TO_AUTOPROCESSOR_MAPPING or "tokenizer" in sub_processor_type: | ||
| if any(modality in sub_processor_type for modality in MODALITY_TO_AUTOPROCESSOR_MAPPING.keys()): | ||
| attributes.append(sub_processor_type) | ||
|
|
||
| # Legacy processors may not override `__init__` and instead expose modality | ||
|
|
@@ -1419,7 +1448,7 @@ def get_attributes(cls): | |
| inferred_attribute = attribute_name[: -len("_class")] | ||
| if inferred_attribute == "audio_tokenizer": | ||
| continue | ||
| if inferred_attribute in MODALITY_TO_AUTOPROCESSOR_MAPPING or "tokenizer" in inferred_attribute: | ||
| if any(modality in inferred_attribute for modality in MODALITY_TO_AUTOPROCESSOR_MAPPING.keys()): | ||
| attributes.append(inferred_attribute) | ||
|
|
||
| return attributes | ||
|
|
@@ -1447,32 +1476,80 @@ def register_for_auto_class(cls, auto_class="AutoProcessor"): | |
| cls._auto_class = auto_class | ||
|
|
||
| @classmethod | ||
| def _get_arguments_from_pretrained(cls, pretrained_model_name_or_path, **kwargs): | ||
| def _get_arguments_from_pretrained(cls, pretrained_model_name_or_path, processor_dict=None, **kwargs): | ||
| """ | ||
| Identify and instantiate the subcomponents of Processor classes, such as image processors, tokenizers, | ||
| and feature extractors. This method inspects the processor's `__init__` signature to identify parameters | ||
| that correspond to known modality types (image_processor, tokenizer, feature_extractor, etc.) or contain | ||
| "tokenizer" in their name. It then uses the appropriate Auto class (AutoImageProcessor, AutoTokenizer, etc.) | ||
| from `MODALITY_TO_AUTOPROCESSOR_MAPPING` to load each subcomponent via `.from_pretrained()`. For tokenizer-like | ||
| parameters not explicitly in the mapping, the method uses AutoTokenizer with a subfolder argument. | ||
| modality names in their attribute name. | ||
|
|
||
| For tokenizers: Uses the appropriate Auto class (AutoTokenizer) to load via `.from_pretrained()`. | ||
| Additional tokenizers (e.g., "decoder_tokenizer") are loaded from subfolders. | ||
|
|
||
| For other sub-processors (image_processor, feature_extractor, etc.): Primary ones are loaded via | ||
| Auto class. Additional ones are instantiated from the config stored in processor_config.json | ||
| (passed as processor_dict). | ||
|
|
||
| Args: | ||
| pretrained_model_name_or_path: Path or model id to load from. | ||
| processor_dict: Optional dict containing processor config (from processor_config.json). | ||
| Required when loading additional non-tokenizer sub-processors. | ||
| """ | ||
| args = [] | ||
| processor_dict = processor_dict if processor_dict is not None else {} | ||
| # Remove subfolder from kwargs to avoid duplicate keyword arguments | ||
| subfolder = kwargs.pop("subfolder", "") | ||
|
|
||
| # get args from processor init signature | ||
| sub_processors = cls.get_attributes() | ||
| for sub_processor_type in sub_processors: | ||
| if sub_processor_type in MODALITY_TO_AUTOPROCESSOR_MAPPING: | ||
| modality = _get_modality_for_attribute(sub_processor_type) | ||
| is_primary = sub_processor_type == modality | ||
|
|
||
| if is_primary: | ||
| # Primary non-tokenizer sub-processor: load via Auto class | ||
| auto_processor_class = MODALITY_TO_AUTOPROCESSOR_MAPPING[sub_processor_type] | ||
| sub_processor = auto_processor_class.from_pretrained(pretrained_model_name_or_path, **kwargs) | ||
| sub_processor = auto_processor_class.from_pretrained( | ||
| pretrained_model_name_or_path, subfolder=subfolder, **kwargs | ||
| ) | ||
| args.append(sub_processor) | ||
| elif "tokenizer" in sub_processor_type: | ||
| # Special case: tokenizer-like parameters not in the mapping (e.g., "protein_tokenizer") | ||
| # Load using AutoTokenizer with subfolder | ||
| auto_processor_class = MODALITY_TO_AUTOPROCESSOR_MAPPING["tokenizer"] | ||
| tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type | ||
| sub_processor = auto_processor_class.from_pretrained( | ||
| pretrained_model_name_or_path, subfolder=sub_processor_type, **kwargs | ||
| pretrained_model_name_or_path, subfolder=tokenizer_subfolder, **kwargs | ||
| ) | ||
| args.append(sub_processor) | ||
|
|
||
| elif sub_processor_type in processor_dict: | ||
| # Additional non-tokenizer sub-processor: instantiate from config in processor_dict | ||
| sub_processor_config = processor_dict[sub_processor_type] | ||
| if isinstance(sub_processor_config, dict): | ||
| # Determine the class to instantiate | ||
| # Image processors have 'image_processor_type', feature extractors have 'feature_extractor_type' | ||
| type_key = f"{modality}_type" | ||
| class_name = sub_processor_config.get(type_key) | ||
| if class_name is None: | ||
| raise ValueError( | ||
| f"Cannot instantiate {sub_processor_type}: missing '{type_key}' in config. " | ||
| f"Config keys: {list(sub_processor_config.keys())}" | ||
| ) | ||
| processor_class = cls.get_possibly_dynamic_module(class_name) | ||
| sub_processor = processor_class(**sub_processor_config) | ||
| args.append(sub_processor) | ||
| else: | ||
| raise ValueError( | ||
| f"Expected dict for {sub_processor_type} in processor_config.json, " | ||
| f"got {type(sub_processor_config)}" | ||
| ) | ||
| else: | ||
| raise ValueError( | ||
| f"Cannot find config for {sub_processor_type} in processor_config.json. " | ||
| f"Available keys: {list(processor_dict.keys())}" | ||
| ) | ||
|
Comment on lines
+1547
to
+1551
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also not sure it's a good idea to raise an error if the attribute has no config dict. One possible use-case is when a processor has optional attributes that are not available on purpose (see #40447)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I understood the use case in the issue linked :(, do you have an example? This code path would only be used by processor_dict that corresponds to a sub-processor |
||
|
|
||
| return args | ||
|
|
||
| @staticmethod | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @eustlb , moved this to the auto files to have one source of truth, and removed attributes as they are now auto-detected