Auto processor #14465

sgugger · 2021-11-19T20:12:25Z

What does this PR do?

This PR adds an AutoProcessor API, similar to AutoTokenizer and AutoFeatureExtractor.

sgugger · 2021-11-19T20:54:44Z

src/transformers/models/auto/processing_auto.py

+        # First, look for a processor_type in the preprocessor_config
+        config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
+        if "processor_class" in config_dict:
+            processor_class = processor_class_from_name(config_dict["processor_class"])


Note that here I chose "processor_class". Feature extractors use feature_extractor_type but tokenizers use tokenizer_class, which I think is more adapted as the value is a class name. Which is a type if you want to go there, but by model_type we imply something like "bert" or "speech_to_text", not a class name.

LysandreJik

Thanks for working on it! What I had in mind when we were discussing the AutoProcessor was a bit different, it was an object that would return the correct preprocessor for each checkpoint. That would be a BertTokenizerFast for BERT, and a Wav2Vec2Processor for Wav2Vec2.

This way the code necessary to instantiate a model and its preprocessor could be identical, independent of the checkpoint.

The AutoProcessor as you have designed it is necessary anyway, so this still looks good to me as-is.

LysandreJik · 2021-11-22T06:12:23Z

src/transformers/models/auto/feature_extraction_auto.py

-        The tokenizer class to instantiate is selected based on the :obj:`model_type` property of the config object
-        (either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when it's
-        missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`:
+        The feature extractor class to instantiate is selected based on the :obj:`model_type` property of the config
+        object (either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when
+        it's missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`:


src/transformers/models/auto/processing_auto.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

sgugger · 2021-11-22T12:32:33Z

I can amend the PR to have the auto-processor then go to a tokenizer (if available) or a feature extractor (if available), as I think that's the logic we want anyway.

LysandreJik

LGTM, thanks for the integration tests!

LysandreJik · 2021-12-01T15:36:23Z

Discussed it a bit with @patrickvonplaten that isn't as excited as I am about the AutoProcessor being an umbrella over all modalities' preprocessors and he raises important API questions.

Will let him comment so that we all align on the choices :)

patrickvonplaten · 2021-12-01T16:26:04Z

I see the need for a AutoProcessor class, but I'm not a fan of making it an umbrella class for both tokenizers, feature extractors and processors because:

i) It goes a bit against our "no-magic" & easy-to-understand code IMO. Having AutoProcessor wrap both AutoTokenizer makes this code quite difficult to understand. E.g. if for some reason this class fails to load an NLP tokenizer, the traceback can become quite complex (AutoProcessor -> AutoTokenizer -> (here multiple ways of loading the AutoTokenizer via tokenizer_config, tokenizer_type, model config). Also I'm quite sure this function will become much more complex over time to handle all kinds of weird use cases. We could limit this complexity by not making it return AutoFeatureExtractor or AutoTokenizer

ii) IMO it breaks a design pattern. So far we had the following design pattern IMO:

AutoTokenizer returns a tokenizer of type PreTrainedTokenizer and PretrainedTokenizerFast
AutoFeatureExtractor returns a feature extractor of type FeatureExtractionMixin.
-> both of those classes has IMO more or less the same design.
It is much more intuitive IMO that AutoProcessor only return ...Processor objects and nothing more. Also, I understand a ...Processor not really as a general "whatever-you-can-process" class, but as a wrapper object that always includes two or more pre- or postprocessing objects (e.g. a speech input pre-processor and text output post-processor). Admittingly the naming is not great here though as ...Processor does encompass pretty much all kinds of tokenization, feature extraction, etc...

iii) I don't see the use-case of this class really. IMO there is no need to force an Auto... class to be useful for more than one task (or modality). E.g. I don't think many users are keen to have a single script in which they can quickly switch between an text tokenizer and a speech recognition processor => for me the beauty of Auto... is to be able to quickly try out multiple different checkpoints for the same task. To do so, it's enough to pick one Auto... model class such as AutoModelForCausalLM together with, e.g. AutoTokenizer. I don't see at all the need to be able to quickly switch between different tasks in the same script. If one wants to switch for a language-generation task to let's say speech classification, I don't think the convenience of not having to change AutoTokenizer to AutoFeatureExtraction is worth much compared to the complexity added to this function.

iiii) The idea here is really that AutoProcessor can be used for all kinds of preprocessing. This might make the user believe that the same holds true for AutoModel. But AutoModel is different IMO as it only returns the encoder of the models and can not really include all models (e.g. RAG, EncoderDecoder, SpeechEncoderDecoder, ...)

To conclude, I would prefer to have AutoProcessor just return ...Processor objects and neither feature extractors nor tokenizers.

There is one thing, where I see my logic a bit flawed and where I understand why this class is coded the way it is:

a) The "...Processing" name. I agree that all pre- and post- tokenization, feature extraction, etc... can be summarized by the name "processing".

Very interested in discussing this a bit more!

julien-c · 2021-12-01T16:28:30Z

i know nothing about the details of this but from my superficial understanding of this I agree that "I'm not a fan of making it an umbrella class for both tokenizers, feature extractors and processors"

julien-c · 2021-12-01T16:30:38Z

(note that thanks to @sgugger automated metadata sharing we will soon be able to display an actually sensible sample code for tokenization/preprocessing/etc on the transformers models in the hub)

sgugger · 2021-12-01T18:02:48Z

I have absolutely no strong opinion on this, I added this because @LysandreJik told me to :-)

* Add AutoProcessor class * Init and tests * Add doc * Fix init * Update src/transformers/models/auto/processing_auto.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Reverts to tokenizer or feature extractor when available * Adapt test Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

patrickvonplaten · 2022-08-24T17:32:43Z

Following the discussion of https://github.com/huggingface/moon-landing/issues/3632 ,

Want to maybe kick-start a discussion here as this PR / issue has been hanging a bit in the air and I think it was mostly me that was blocking this PR and it might be time to unblock it.

Having revisited my points here, I guess my opinion changed a bit with regard to:

i) The no-magic philosophy applies a bit less to Auto.... class I guess since they can now also directly load from the Hub, cover all kinds of models etc... so would not count this as a strong argument anymore. I do think we'll quickly get quite complex code in AutoProcessor to handle all the weird use cases

ii) Still feel strong about this as it clearly breaks a pattern to me and is somewhat unexpected to someone that knows transformers well IMO (AutoProcessor returns ...Tokenizer) even though there is a AutoTokenizer is not clean IMO.

iii) I do see a clearer use case now! So happy to scratch that

iv) think it's also not that big of a deal and think models and processors can just be treated differently

=> so overall if @LysandreJik @thomwolf @sgugger you're more in favor of merging this PR, happy to be outvoted here :-) Don't feel that strongly about it anymore.

sgugger · 2022-08-31T14:40:07Z

If we pick a different name for the auto class (not AutoProcessor), I think it makes ii) a moot point. Since it seems to be your biggest argument against, would that be enough of a compromise for you?

patrickvonplaten · 2022-09-02T18:13:36Z

Yes, also totally fine for me to go with AutoProcessor - I do see the easy of use as a big argument that outweighs 2) -> so also ok for me to use AutoProcessor if that's the best name here

LysandreJik · 2022-09-05T12:04:05Z

Glad to see we're aligned for a good coverage of processors! I personally like AutoProcessor and think it's not necessarily unclear if we have some good documentation.

AutoProcessor is also the current UI for some models, so the API won't be changed for these models (which is good).

sgugger added 2 commits November 19, 2021 14:13

Add AutoProcessor class

86297c4

Init and tests

e63d3b6

sgugger requested a review from LysandreJik November 19, 2021 20:47

Add doc

4dec8d8

sgugger commented Nov 19, 2021

View reviewed changes

Fix init

40cf6b0

LysandreJik approved these changes Nov 22, 2021

View reviewed changes

Update src/transformers/models/auto/processing_auto.py

5448e31

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

sgugger added 2 commits November 22, 2021 11:26

Reverts to tokenizer or feature extractor when available

77659ff

Adapt test

bbdde5f

LysandreJik approved these changes Nov 22, 2021

View reviewed changes

sgugger merged commit 204d251 into master Nov 22, 2021

sgugger deleted the auto_processor branch November 22, 2021 17:17

LysandreJik restored the auto_processor branch December 4, 2021 09:55

LysandreJik deleted the auto_processor branch May 3, 2022 14:25

sgugger mentioned this pull request Sep 9, 2022

Make AutoProcessor a magic loading class for all modalities #18963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto processor #14465

Auto processor #14465

sgugger commented Nov 19, 2021

sgugger Nov 19, 2021

LysandreJik left a comment

LysandreJik Nov 22, 2021

sgugger commented Nov 22, 2021

LysandreJik left a comment

LysandreJik commented Dec 1, 2021

patrickvonplaten commented Dec 1, 2021

julien-c commented Dec 1, 2021 •

edited

Loading

julien-c commented Dec 1, 2021

sgugger commented Dec 1, 2021

patrickvonplaten commented Aug 24, 2022

sgugger commented Aug 31, 2022

patrickvonplaten commented Sep 2, 2022

LysandreJik commented Sep 5, 2022

Auto processor #14465

Auto processor #14465

Conversation

sgugger commented Nov 19, 2021

What does this PR do?

sgugger Nov 19, 2021

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Nov 22, 2021

Choose a reason for hiding this comment

sgugger commented Nov 22, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik commented Dec 1, 2021

patrickvonplaten commented Dec 1, 2021

julien-c commented Dec 1, 2021 • edited Loading

julien-c commented Dec 1, 2021

sgugger commented Dec 1, 2021

patrickvonplaten commented Aug 24, 2022

sgugger commented Aug 31, 2022

patrickvonplaten commented Sep 2, 2022

LysandreJik commented Sep 5, 2022

julien-c commented Dec 1, 2021 •

edited

Loading