Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Feb 22, 2021

🚨🚨🚨IMPORTANT Wav2Vec2 repositories that were added before 4.4 should make sure to manually add a feature extractor class.

This can be done as easily as doing:

git clone <your/repo/>
cd <your/repo/>
from transformers import Wav2Vec2FeatureExtractor
feat_extract = Wav2Vec2FeatureExtractor()  # or feat_extract = Wav2Vec2FeatureExtractor(return_attention_mask=True) for lv60 models
feat_extract.save_pretrained("./")
git add . && git commit -m "add feature processor file" && git push

What does this PR do?

This is a new design for how to handle the feature extraction + tokenization functionality for speech models in a single class.
Speech models connect the two different formats speech and text. In order to have more flexibility when extending Transformers to speech tasks, such as ASR, I propose a composite Processor class that has both a tokenizer and a feature_extractor attribute, similar to how composite tokenizer are currently handled for models, such as RAG, see.

For ASR models the output of the model is text so that a tokenizer is required and the input is a sequence of feature_vectors (which includes raw waveform features) so that a feature_extractor is required.
The tokenizer is hereby of the exact same format as our current tokenizer implementations (e.g. Speech2TextTransformer models train their tokenizers the same way NLP models train their tokenizers, see section 4.1 here). Feature processors on the other hand are of a completely new format and therefore deserve a PreTrainedFeatureExtractor class that mostly handles the loading & saving for all feature extractors and in addition, provides padding functionality. Since feature extractors are deterministic by nature (feature extractors are not trained, as tokenizers can be), we only need a single feature_extractor_config.json file to load and save the class IMO.
To meet the demands of a single model processing class that can handle both the text and speech modality while being flexible enough for different kinds of speech models, I propose to add a composite SpeechProcessor class for each speech model that has both a tokenizer and feature_extractor attribute and in short, would look as follows for Wav2Vec2:

Wav2Vec2Processor:

  def __init__(feature_extractor: Wav2Vec2FeatureExtractor, tokenizer: Wav2Vec2CTCTokenizer):
    self.feature_extractor = feature_extractor
    self.tokenizer = tokenizer

Wav2Vec2CTCTokenizer(PreTrainedTokenizer):

   ...

Wav2Vec2FeatureExtractor(PreTrainedFeatureExtractor):

  ...

So this means we leverage all the existing functionalities of the tokenizers for the tokenizer part of the speech models and
create a new PreTrainedFeatureExtractor to handle general feature extraction functionality. The composite Wav2Vec2Processor is then in style very similar to RagTokenizer and would provide the following functionality to the user:

from transformers import Wav2Vec2SpeechProcessor, Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretained("facebook/wav2vec2-base-960h")
processor = Wav2Vec2SpeechProcessor.from_pretrained("facebook/wav2vec2-base-960h")

inputs = processor(raw_waveform, return_tensors="pt", padding="longest")
logits = model(**inputs)
predicted_ids = torch.argmax(logits, dim=-1)

pred_transcription = model.batch_decode(predicted_ids)

# Also the processor can then later be used to encode & decode labels, *e.g.*
with processor.as_tokenizer():
       label_ids = processor(label_str)

A couple of advantages of the following design:

  • It makes sense logically. When we add multi-modal models, it is quite natural for me to add compotise ...Processor classes to the library as well
  • It is general enough to handle a bunch of different use cases. E.g. Speech2TextTransformers will have more or less the same feature extractor for the different tasks it was trained on, but will have different tokenizers depending on whether the model was trained on Librispeech/Must-C or Covost (cc @patil-suraj). The current design can handle this very nicely by simply changing the tokenizer
  • We only need to create a PretrainedFeatureExtractor class, all the Speech model's tokenization functionality is handled by the already existing PreTrainedTokenizer class.
  • It's general enough to handle all speech models IMO

Backwards breaking compatibility

Wav2Vec2Tokenzier is deprecated and is replaced by a better Wav2Vec2CTCTokenizer class that actually can inherit the full tokenizer test suit. Wav2Vec2Tokenizer can still be used by is not be found in the docs anymore. It was made sure that the tokenizer configs stay the same for bcp so that I only had to add files for the Wav2Vec2FeatureProcessor (see: https://huggingface.co/facebook/wav2vec2-base-960h/commit/dbdb8c54a01c6b0ca8ec79f811970214fb72cecc).

Essentially, one is advised to replace Wav2Vec2Tokenizer with Wav2Vec2Processor in all scripts from now, whereas the API of Wav2Vec2Processor is identical to the API of the old `Wav2Vec2Tokenizer

The only big breaking change is that the AutoTokenizer now loads Wav2Vec2CTCTokenizer instead of Wav2Vec2Tokenizer

Review

@LysandreJik, @patil-suraj, @sgugger, @thomwolf - this PR is now ready for a complete review.
@patil-suraj, it would be very nice, if you could do a very thorough review and make sure that this design is 100% compatible with the Speech2TextTransformersProcessor that we'll add soon.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks great and doesn't seem limiting at all. Implementing it for Wav2Vec2/SpeechToTextTransformer and refactoring/upstreaming methods down the road seems like a good implementation roadmap.

Regarding the implementation of FeatureProcessors, what do you have in mind regarding understandability/explicitness? Do you expect something like models, where we aim for maximum accessibility, with copy/pastes and single-file containers, or do you expect something like tokenizers, where some tokenizers inherit from others while modifying certain aspects, and some level of abstraction, making them harder to decypher?

I'm asking because I think it's relevant to the different preprocessing that can be handled by the feature processors. For example, normalizing or converting to MFCCs seems like it would be something quite widespread amoung speech-based feature processors, do we want to have that in each implementation (abstraction-free) or will the goal be to upstream these methods in the parent class once we identify similarities among feature processors?

@patrickvonplaten
Copy link
Contributor Author

This approach looks great and doesn't seem limiting at all. Implementing it for Wav2Vec2/SpeechToTextTransformer and refactoring/upstreaming methods down the road seems like a good implementation roadmap.

Regarding the implementation of FeatureProcessors, what do you have in mind regarding understandability/explicitness? Do you expect something like models, where we aim for maximum accessibility, with copy/pastes and single-file containers, or do you expect something like tokenizers, where some tokenizers inherit from others while modifying certain aspects, and some level of abstraction, making them harder to decypher?

I'm asking because I think it's relevant to the different preprocessing that can be handled by the feature processors. For example, normalizing or converting to MFCCs seems like it would be something quite widespread among speech-based feature processors, do we want to have that in each implementation (abstraction-free) or will the goal be to upstream these methods in the parent class once we identify similarities among feature processors?

Yeah good question! To be honest, I'm not really sure yet. I would like to enforce the rule that feature extractors can only inherit from PreTrainedFeatureExtractor and no other feature extractor. IMO, the best approach to begin with is to limit (as you've suggested) the user-facing API for FeatureProcessor to __call__, from_pretrained(), save_pretrained() and maybe something like from_file() and then in the beginning.
I think a method like pad() is general enough to have this method in the beginning be implemented in PreTrainedFeatureExtractor because every extractor will need to do padding no matter what.

For pretty much all other methods (actually including normalization()), I would copy-paste them into each feature processor and make sure that they are private methods _normalize() so that we can later still do some refactoring here if needed.

So in general my strategy would be to have as little abstraction as possible - e.g. copy-paste classes such as those:

to each feature extractor - and then when having more models maybe move some things upstream into the PretrainedFeatureExtractor file

@LysandreJik
Copy link
Member

Thanks for explaining, sounds like a good approach to me! Thanks for drafting the proposal.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this design a lot! I don't think we need to decide right now for normalize/from_file methods and we can always refactor those down the line (it's not a breaking change on the user side anyway).

I also completely agree that the feature processors should be loaded from one json file only (as long as we make it general enough),

@patil-suraj patil-suraj mentioned this pull request Feb 23, 2021
5 tasks
@patrickvonplaten patrickvonplaten changed the title [SpeechProcessor] Design proposal [Speech MultiModalDesign] PretrainedFeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer Feb 24, 2021
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I especially liked all the new tests!
One thing I am very worried about is the public functions/classes defined in two different files with the same name. This is a design that will create lots of problems in my opinion and in this case, it's for common utils that should only be defined in a given module (new if necessary).

LONGEST = "longest"
MAX_LENGTH = "max_length"
DO_NOT_PAD = "do_not_pad"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the objects up until here are common objects for our internal use. This is not a modeling file so I would expect all of those to be defined in one place and shared. This is especially important for objects like TensorType that are in the main init of transformers and should only be defined in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree very much actually!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move them to file_utils.py -> think this is the cleanest option! The other option would be to just import them from tokenization_utils_base.py, but I think it's fair to move them since they have become more generic than. just tokenization. I don't think that there is a break in bcp.

DO_NOT_PAD = "do_not_pad"


class BatchFeature(UserDict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should have some "copied from BatchEncoding" statements (and I think there is a common parent class to write here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some, but I think the only function that really shares more or less all the code is the to(...) function -> all other functions are quite different from each other (mainly because there is no _encodings attribute for BatchFeature) so think it's ok to not have a common parent class for now?

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/file_utils.py Outdated Show resolved Hide resolved
src/transformers/models/wav2vec2/tokenization_wav2vec2.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on the feature extractor/processor/tokenizer. I think the API you've designed looks great.

I've left a few comments, mostly nitpicks, as I'm overall very fond of the API proposed.

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
tests/test_feature_extraction_common.py Outdated Show resolved Hide resolved
tests/test_feature_extraction_common.py Show resolved Hide resolved
tests/test_feature_extraction_common.py Show resolved Hide resolved
tests/test_feature_extraction_wav2vec2.py Outdated Show resolved Hide resolved
tests/test_processor_wav2vec2.py Show resolved Hide resolved
Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks really great! Thanks a lot, @patrickvonplaten.

I just left a couple of nits. Apart from the attention_mask being 1D, everything looks great.

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_utils.py Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! Just have one more nit and it should be good to merge!

docs/source/internal/feature_extraction_utils.rst Outdated Show resolved Hide resolved
@patrickvonplaten patrickvonplaten merged commit cb38ffc into huggingface:master Feb 25, 2021
@patrickvonplaten patrickvonplaten deleted the speech_processor_design branch February 25, 2021 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants