New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324
[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks great and doesn't seem limiting at all. Implementing it for Wav2Vec2/SpeechToTextTransformer and refactoring/upstreaming methods down the road seems like a good implementation roadmap.
Regarding the implementation of FeatureProcessors
, what do you have in mind regarding understandability/explicitness? Do you expect something like models, where we aim for maximum accessibility, with copy/pastes and single-file containers, or do you expect something like tokenizers, where some tokenizers inherit from others while modifying certain aspects, and some level of abstraction, making them harder to decypher?
I'm asking because I think it's relevant to the different preprocessing that can be handled by the feature processors. For example, normalizing or converting to MFCCs seems like it would be something quite widespread amoung speech-based feature processors, do we want to have that in each implementation (abstraction-free) or will the goal be to upstream these methods in the parent class once we identify similarities among feature processors?
Yeah good question! To be honest, I'm not really sure yet. I would like to enforce the rule that feature extractors can only inherit from For pretty much all other methods (actually including So in general my strategy would be to have as little abstraction as possible - e.g. copy-paste classes such as those: Line 239 in 19c1457
PretrainedFeatureExtractor file
|
Thanks for explaining, sounds like a good approach to me! Thanks for drafting the proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this design a lot! I don't think we need to decide right now for normalize
/from_file
methods and we can always refactor those down the line (it's not a breaking change on the user side anyway).
I also completely agree that the feature processors should be loaded from one json file only (as long as we make it general enough),
src/transformers/models/wav2vec2/feature_processing_wav2vec2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I especially liked all the new tests!
One thing I am very worried about is the public functions/classes defined in two different files with the same name. This is a design that will create lots of problems in my opinion and in this case, it's for common utils that should only be defined in a given module (new if necessary).
LONGEST = "longest" | ||
MAX_LENGTH = "max_length" | ||
DO_NOT_PAD = "do_not_pad" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the objects up until here are common objects for our internal use. This is not a modeling file so I would expect all of those to be defined in one place and shared. This is especially important for objects like TensorType
that are in the main init of transformers and should only be defined in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree very much actually!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move them to file_utils.py
-> think this is the cleanest option! The other option would be to just import them from tokenization_utils_base.py
, but I think it's fair to move them since they have become more generic than. just tokenization. I don't think that there is a break in bcp.
DO_NOT_PAD = "do_not_pad" | ||
|
||
|
||
class BatchFeature(UserDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should have some "copied from BatchEncoding" statements (and I think there is a common parent class to write here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some, but I think the only function that really shares more or less all the code is the to(...)
function -> all other functions are quite different from each other (mainly because there is no _encodings
attribute for BatchFeature
) so think it's ok to not have a common parent class for now?
src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on the feature extractor/processor/tokenizer. I think the API you've designed looks great.
I've left a few comments, mostly nitpicks, as I'm overall very fond of the API proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API looks really great! Thanks a lot, @patrickvonplaten.
I just left a couple of nits. Apart from the attention_mask
being 1D, everything looks great.
src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
Outdated
Show resolved
Hide resolved
src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
Outdated
Show resolved
Hide resolved
src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
Outdated
Show resolved
Hide resolved
src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating! Just have one more nit and it should be good to merge!
🚨🚨🚨IMPORTANT Wav2Vec2 repositories that were added before 4.4 should make sure to manually add a feature extractor class.
This can be done as easily as doing:
What does this PR do?
This is a new design for how to handle the feature extraction + tokenization functionality for speech models in a single class.
Speech models connect the two different formats
speech
andtext
. In order to have more flexibility when extending Transformers to speech tasks, such as ASR, I propose a compositeProcessor
class that has both atokenizer
and afeature_extractor
attribute, similar to how composite tokenizer are currently handled for models, such as RAG, see.For ASR models the output of the model is text so that a
tokenizer
is required and the input is a sequence offeature_vectors
(which includes raw waveform features) so that afeature_extractor
is required.The tokenizer is hereby of the exact same format as our current tokenizer implementations (e.g. Speech2TextTransformer models train their tokenizers the same way NLP models train their tokenizers, see section 4.1 here). Feature processors on the other hand are of a completely new format and therefore deserve a
PreTrainedFeatureExtractor
class that mostly handles the loading & saving for all feature extractors and in addition, provides padding functionality. Since feature extractors are deterministic by nature (feature extractors are not trained, as tokenizers can be), we only need a singlefeature_extractor_config.json
file to load and save the class IMO.To meet the demands of a single model processing class that can handle both the text and speech modality while being flexible enough for different kinds of speech models, I propose to add a composite
SpeechProcessor
class for each speech model that has both atokenizer
andfeature_extractor
attribute and in short, would look as follows for Wav2Vec2:So this means we leverage all the existing functionalities of the tokenizers for the tokenizer part of the speech models and
create a new
PreTrainedFeatureExtractor
to handle general feature extraction functionality. The compositeWav2Vec2Processor
is then in style very similar toRagTokenizer
and would provide the following functionality to the user:A couple of advantages of the following design:
...Processor
classes to the library as wellSpeech2TextTransformers
will have more or less the same feature extractor for the different tasks it was trained on, but will have different tokenizers depending on whether the model was trained on Librispeech/Must-C or Covost (cc @patil-suraj). The current design can handle this very nicely by simply changing the tokenizerPretrainedFeatureExtractor
class, all the Speech model's tokenization functionality is handled by the already existingPreTrainedTokenizer
class.Backwards breaking compatibility
Wav2Vec2Tokenzier
is deprecated and is replaced by a betterWav2Vec2CTCTokenizer
class that actually can inherit the full tokenizer test suit.Wav2Vec2Tokenizer
can still be used by is not be found in the docs anymore. It was made sure that the tokenizer configs stay the same for bcp so that I only had to add files for theWav2Vec2FeatureProcessor
(see: https://huggingface.co/facebook/wav2vec2-base-960h/commit/dbdb8c54a01c6b0ca8ec79f811970214fb72cecc).Essentially, one is advised to replace
Wav2Vec2Tokenizer
withWav2Vec2Processor
in all scripts from now, whereas the API ofWav2Vec2Processor
is identical to the API of the old `Wav2Vec2TokenizerThe only big breaking change is that the AutoTokenizer now loads
Wav2Vec2CTCTokenizer
instead ofWav2Vec2Tokenizer
Review
@LysandreJik, @patil-suraj, @sgugger, @thomwolf - this PR is now ready for a complete review.
@patil-suraj, it would be very nice, if you could do a very thorough review and make sure that this design is 100% compatible with the
Speech2TextTransformersProcessor
that we'll add soon.