Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto processor #14465

Merged
merged 7 commits into from
Nov 22, 2021
Merged

Auto processor #14465

merged 7 commits into from
Nov 22, 2021

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Nov 19, 2021

What does this PR do?

This PR adds an AutoProcessor API, similar to AutoTokenizer and AutoFeatureExtractor.

# First, look for a processor_type in the preprocessor_config
config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
if "processor_class" in config_dict:
processor_class = processor_class_from_name(config_dict["processor_class"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that here I chose "processor_class". Feature extractors use feature_extractor_type but tokenizers use tokenizer_class, which I think is more adapted as the value is a class name. Which is a type if you want to go there, but by model_type we imply something like "bert" or "speech_to_text", not a class name.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! What I had in mind when we were discussing the AutoProcessor was a bit different, it was an object that would return the correct preprocessor for each checkpoint. That would be a BertTokenizerFast for BERT, and a Wav2Vec2Processor for Wav2Vec2.

This way the code necessary to instantiate a model and its preprocessor could be identical, independent of the checkpoint.

The AutoProcessor as you have designed it is necessary anyway, so this still looks good to me as-is.

Comment on lines -84 to +86
The tokenizer class to instantiate is selected based on the :obj:`model_type` property of the config object
(either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when it's
missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`:
The feature extractor class to instantiate is selected based on the :obj:`model_type` property of the config
object (either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when
it's missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

src/transformers/models/auto/processing_auto.py Outdated Show resolved Hide resolved
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
@sgugger
Copy link
Collaborator Author

sgugger commented Nov 22, 2021

I can amend the PR to have the auto-processor then go to a tokenizer (if available) or a feature extractor (if available), as I think that's the logic we want anyway.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the integration tests!

@sgugger sgugger merged commit 204d251 into master Nov 22, 2021
@sgugger sgugger deleted the auto_processor branch November 22, 2021 17:17
@LysandreJik
Copy link
Member

Discussed it a bit with @patrickvonplaten that isn't as excited as I am about the AutoProcessor being an umbrella over all modalities' preprocessors and he raises important API questions.

Will let him comment so that we all align on the choices :)

@patrickvonplaten
Copy link
Contributor

I see the need for a AutoProcessor class, but I'm not a fan of making it an umbrella class for both tokenizers, feature extractors and processors because:

i) It goes a bit against our "no-magic" & easy-to-understand code IMO. Having AutoProcessor wrap both AutoTokenizer makes this code quite difficult to understand. E.g. if for some reason this class fails to load an NLP tokenizer, the traceback can become quite complex (AutoProcessor -> AutoTokenizer -> (here multiple ways of loading the AutoTokenizer via tokenizer_config, tokenizer_type, model config). Also I'm quite sure this function will become much more complex over time to handle all kinds of weird use cases. We could limit this complexity by not making it return AutoFeatureExtractor or AutoTokenizer

ii) IMO it breaks a design pattern. So far we had the following design pattern IMO:

  • AutoTokenizer returns a tokenizer of type PreTrainedTokenizer and PretrainedTokenizerFast
  • AutoFeatureExtractor returns a feature extractor of type FeatureExtractionMixin.
    -> both of those classes has IMO more or less the same design.
    It is much more intuitive IMO that AutoProcessor only return ...Processor objects and nothing more. Also, I understand a ...Processor not really as a general "whatever-you-can-process" class, but as a wrapper object that always includes two or more pre- or postprocessing objects (e.g. a speech input pre-processor and text output post-processor). Admittingly the naming is not great here though as ...Processor does encompass pretty much all kinds of tokenization, feature extraction, etc...

iii) I don't see the use-case of this class really. IMO there is no need to force an Auto... class to be useful for more than one task (or modality). E.g. I don't think many users are keen to have a single script in which they can quickly switch between an text tokenizer and a speech recognition processor => for me the beauty of Auto... is to be able to quickly try out multiple different checkpoints for the same task. To do so, it's enough to pick one Auto... model class such as AutoModelForCausalLM together with, e.g. AutoTokenizer. I don't see at all the need to be able to quickly switch between different tasks in the same script. If one wants to switch for a language-generation task to let's say speech classification, I don't think the convenience of not having to change AutoTokenizer to AutoFeatureExtraction is worth much compared to the complexity added to this function.

iiii) The idea here is really that AutoProcessor can be used for all kinds of preprocessing. This might make the user believe that the same holds true for AutoModel. But AutoModel is different IMO as it only returns the encoder of the models and can not really include all models (e.g. RAG, EncoderDecoder, SpeechEncoderDecoder, ...)

To conclude, I would prefer to have AutoProcessor just return ...Processor objects and neither feature extractors nor tokenizers.

There is one thing, where I see my logic a bit flawed and where I understand why this class is coded the way it is:

a) The "...Processing" name. I agree that all pre- and post- tokenization, feature extraction, etc... can be summarized by the name "processing".

Very interested in discussing this a bit more!

@julien-c
Copy link
Member

julien-c commented Dec 1, 2021

i know nothing about the details of this but from my superficial understanding of this I agree that "I'm not a fan of making it an umbrella class for both tokenizers, feature extractors and processors"

@julien-c
Copy link
Member

julien-c commented Dec 1, 2021

(note that thanks to @sgugger automated metadata sharing we will soon be able to display an actually sensible sample code for tokenization/preprocessing/etc on the transformers models in the hub)

@sgugger
Copy link
Collaborator Author

sgugger commented Dec 1, 2021

I have absolutely no strong opinion on this, I added this because @LysandreJik told me to :-)

@LysandreJik LysandreJik restored the auto_processor branch December 4, 2021 09:55
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* Add AutoProcessor class

* Init and tests

* Add doc

* Fix init

* Update src/transformers/models/auto/processing_auto.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Reverts to tokenizer or feature extractor when available

* Adapt test

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
@LysandreJik LysandreJik deleted the auto_processor branch May 3, 2022 14:25
@patrickvonplaten
Copy link
Contributor

Following the discussion of https://github.com/huggingface/moon-landing/issues/3632 ,

Want to maybe kick-start a discussion here as this PR / issue has been hanging a bit in the air and I think it was mostly me that was blocking this PR and it might be time to unblock it.

Having revisited my points here, I guess my opinion changed a bit with regard to:

i) The no-magic philosophy applies a bit less to Auto.... class I guess since they can now also directly load from the Hub, cover all kinds of models etc... so would not count this as a strong argument anymore. I do think we'll quickly get quite complex code in AutoProcessor to handle all the weird use cases

ii) Still feel strong about this as it clearly breaks a pattern to me and is somewhat unexpected to someone that knows transformers well IMO (AutoProcessor returns ...Tokenizer) even though there is a AutoTokenizer is not clean IMO.

iii) I do see a clearer use case now! So happy to scratch that

iv) think it's also not that big of a deal and think models and processors can just be treated differently

=> so overall if @LysandreJik @thomwolf @sgugger you're more in favor of merging this PR, happy to be outvoted here :-) Don't feel that strongly about it anymore.

@sgugger
Copy link
Collaborator Author

sgugger commented Aug 31, 2022

If we pick a different name for the auto class (not AutoProcessor), I think it makes ii) a moot point. Since it seems to be your biggest argument against, would that be enough of a compromise for you?

@patrickvonplaten
Copy link
Contributor

Yes, also totally fine for me to go with AutoProcessor - I do see the easy of use as a big argument that outweighs 2) -> so also ok for me to use AutoProcessor if that's the best name here

@LysandreJik
Copy link
Member

Glad to see we're aligned for a good coverage of processors! I personally like AutoProcessor and think it's not necessarily unclear if we have some good documentation.

AutoProcessor is also the current UI for some models, so the API won't be changed for these models (which is good).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants