-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto processor #14465
Auto processor #14465
Conversation
# First, look for a processor_type in the preprocessor_config | ||
config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs) | ||
if "processor_class" in config_dict: | ||
processor_class = processor_class_from_name(config_dict["processor_class"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that here I chose "processor_class"
. Feature extractors use feature_extractor_type
but tokenizers use tokenizer_class
, which I think is more adapted as the value is a class name. Which is a type if you want to go there, but by model_type
we imply something like "bert" or "speech_to_text", not a class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it! What I had in mind when we were discussing the AutoProcessor
was a bit different, it was an object that would return the correct preprocessor for each checkpoint. That would be a BertTokenizerFast
for BERT, and a Wav2Vec2Processor
for Wav2Vec2.
This way the code necessary to instantiate a model and its preprocessor could be identical, independent of the checkpoint.
The AutoProcessor
as you have designed it is necessary anyway, so this still looks good to me as-is.
The tokenizer class to instantiate is selected based on the :obj:`model_type` property of the config object | ||
(either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when it's | ||
missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`: | ||
The feature extractor class to instantiate is selected based on the :obj:`model_type` property of the config | ||
object (either passed as an argument or loaded from :obj:`pretrained_model_name_or_path` if possible), or when | ||
it's missing, by falling back to using pattern matching on :obj:`pretrained_model_name_or_path`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
I can amend the PR to have the auto-processor then go to a tokenizer (if available) or a feature extractor (if available), as I think that's the logic we want anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the integration tests!
Discussed it a bit with @patrickvonplaten that isn't as excited as I am about the Will let him comment so that we all align on the choices :) |
I see the need for a i) It goes a bit against our "no-magic" & easy-to-understand code IMO. Having ii) IMO it breaks a design pattern. So far we had the following design pattern IMO:
iii) I don't see the use-case of this class really. IMO there is no need to force an iiii) The idea here is really that To conclude, I would prefer to have There is one thing, where I see my logic a bit flawed and where I understand why this class is coded the way it is: a) The "...Processing" name. I agree that all pre- and post- tokenization, feature extraction, etc... can be summarized by the name "processing". Very interested in discussing this a bit more! |
i know nothing about the details of this but from my superficial understanding of this I agree that "I'm not a fan of making it an umbrella class for both tokenizers, feature extractors and processors" |
(note that thanks to @sgugger automated metadata sharing we will soon be able to display an actually sensible sample code for tokenization/preprocessing/etc on the transformers models in the hub) |
I have absolutely no strong opinion on this, I added this because @LysandreJik told me to :-) |
* Add AutoProcessor class * Init and tests * Add doc * Fix init * Update src/transformers/models/auto/processing_auto.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Reverts to tokenizer or feature extractor when available * Adapt test Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Following the discussion of https://github.com/huggingface/moon-landing/issues/3632 , Want to maybe kick-start a discussion here as this PR / issue has been hanging a bit in the air and I think it was mostly me that was blocking this PR and it might be time to unblock it. Having revisited my points here, I guess my opinion changed a bit with regard to: i) The no-magic philosophy applies a bit less to ii) Still feel strong about this as it clearly breaks a pattern to me and is somewhat unexpected to someone that knows iii) I do see a clearer use case now! So happy to scratch that iv) think it's also not that big of a deal and think models and processors can just be treated differently => so overall if @LysandreJik @thomwolf @sgugger you're more in favor of merging this PR, happy to be outvoted here :-) Don't feel that strongly about it anymore. |
If we pick a different name for the auto class (not |
Yes, also totally fine for me to go with |
Glad to see we're aligned for a good coverage of processors! I personally like
|
What does this PR do?
This PR adds an
AutoProcessor
API, similar toAutoTokenizer
andAutoFeatureExtractor
.