[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

patrickvonplaten · 2021-02-22T08:19:10Z

🚨🚨🚨IMPORTANT Wav2Vec2 repositories that were added before 4.4 should make sure to manually add a feature extractor class.

This can be done as easily as doing:

git clone <your/repo/>
cd <your/repo/>

from transformers import Wav2Vec2FeatureExtractor
feat_extract = Wav2Vec2FeatureExtractor()  # or feat_extract = Wav2Vec2FeatureExtractor(return_attention_mask=True) for lv60 models
feat_extract.save_pretrained("./")

git add . && git commit -m "add feature processor file" && git push

What does this PR do?

This is a new design for how to handle the feature extraction + tokenization functionality for speech models in a single class.
Speech models connect the two different formats speech and text. In order to have more flexibility when extending Transformers to speech tasks, such as ASR, I propose a composite Processor class that has both a tokenizer and a feature_extractor attribute, similar to how composite tokenizer are currently handled for models, such as RAG, see.

For ASR models the output of the model is text so that a tokenizer is required and the input is a sequence of feature_vectors (which includes raw waveform features) so that a feature_extractor is required.
The tokenizer is hereby of the exact same format as our current tokenizer implementations (e.g. Speech2TextTransformer models train their tokenizers the same way NLP models train their tokenizers, see section 4.1 here). Feature processors on the other hand are of a completely new format and therefore deserve a PreTrainedFeatureExtractor class that mostly handles the loading & saving for all feature extractors and in addition, provides padding functionality. Since feature extractors are deterministic by nature (feature extractors are not trained, as tokenizers can be), we only need a single feature_extractor_config.json file to load and save the class IMO.
To meet the demands of a single model processing class that can handle both the text and speech modality while being flexible enough for different kinds of speech models, I propose to add a composite SpeechProcessor class for each speech model that has both a tokenizer and feature_extractor attribute and in short, would look as follows for Wav2Vec2:

Wav2Vec2Processor:

  def __init__(feature_extractor: Wav2Vec2FeatureExtractor, tokenizer: Wav2Vec2CTCTokenizer):
    self.feature_extractor = feature_extractor
    self.tokenizer = tokenizer

Wav2Vec2CTCTokenizer(PreTrainedTokenizer):

   ...

Wav2Vec2FeatureExtractor(PreTrainedFeatureExtractor):

  ...

So this means we leverage all the existing functionalities of the tokenizers for the tokenizer part of the speech models and
create a new PreTrainedFeatureExtractor to handle general feature extraction functionality. The composite Wav2Vec2Processor is then in style very similar to RagTokenizer and would provide the following functionality to the user:

from transformers import Wav2Vec2SpeechProcessor, Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretained("facebook/wav2vec2-base-960h")
processor = Wav2Vec2SpeechProcessor.from_pretrained("facebook/wav2vec2-base-960h")

inputs = processor(raw_waveform, return_tensors="pt", padding="longest")
logits = model(**inputs)
predicted_ids = torch.argmax(logits, dim=-1)

pred_transcription = model.batch_decode(predicted_ids)

# Also the processor can then later be used to encode & decode labels, *e.g.*
with processor.as_tokenizer():
       label_ids = processor(label_str)

A couple of advantages of the following design:

It makes sense logically. When we add multi-modal models, it is quite natural for me to add compotise ...Processor classes to the library as well
It is general enough to handle a bunch of different use cases. E.g. Speech2TextTransformers will have more or less the same feature extractor for the different tasks it was trained on, but will have different tokenizers depending on whether the model was trained on Librispeech/Must-C or Covost (cc @patil-suraj). The current design can handle this very nicely by simply changing the tokenizer
We only need to create a PretrainedFeatureExtractor class, all the Speech model's tokenization functionality is handled by the already existing PreTrainedTokenizer class.
It's general enough to handle all speech models IMO

Backwards breaking compatibility

Wav2Vec2Tokenzier is deprecated and is replaced by a better Wav2Vec2CTCTokenizer class that actually can inherit the full tokenizer test suit. Wav2Vec2Tokenizer can still be used by is not be found in the docs anymore. It was made sure that the tokenizer configs stay the same for bcp so that I only had to add files for the Wav2Vec2FeatureProcessor (see: https://huggingface.co/facebook/wav2vec2-base-960h/commit/dbdb8c54a01c6b0ca8ec79f811970214fb72cecc).

Essentially, one is advised to replace Wav2Vec2Tokenizer with Wav2Vec2Processor in all scripts from now, whereas the API of Wav2Vec2Processor is identical to the API of the old `Wav2Vec2Tokenizer

The only big breaking change is that the AutoTokenizer now loads Wav2Vec2CTCTokenizer instead of Wav2Vec2Tokenizer

Review

@LysandreJik, @patil-suraj, @sgugger, @thomwolf - this PR is now ready for a complete review.
@patil-suraj, it would be very nice, if you could do a very thorough review and make sure that this design is 100% compatible with the Speech2TextTransformersProcessor that we'll add soon.

src/transformers/feature_extraction_utils.py

LysandreJik

This approach looks great and doesn't seem limiting at all. Implementing it for Wav2Vec2/SpeechToTextTransformer and refactoring/upstreaming methods down the road seems like a good implementation roadmap.

Regarding the implementation of FeatureProcessors, what do you have in mind regarding understandability/explicitness? Do you expect something like models, where we aim for maximum accessibility, with copy/pastes and single-file containers, or do you expect something like tokenizers, where some tokenizers inherit from others while modifying certain aspects, and some level of abstraction, making them harder to decypher?

I'm asking because I think it's relevant to the different preprocessing that can be handled by the feature processors. For example, normalizing or converting to MFCCs seems like it would be something quite widespread amoung speech-based feature processors, do we want to have that in each implementation (abstraction-free) or will the goal be to upstream these methods in the parent class once we identify similarities among feature processors?

patrickvonplaten · 2021-02-22T14:49:32Z

This approach looks great and doesn't seem limiting at all. Implementing it for Wav2Vec2/SpeechToTextTransformer and refactoring/upstreaming methods down the road seems like a good implementation roadmap.

Regarding the implementation of FeatureProcessors, what do you have in mind regarding understandability/explicitness? Do you expect something like models, where we aim for maximum accessibility, with copy/pastes and single-file containers, or do you expect something like tokenizers, where some tokenizers inherit from others while modifying certain aspects, and some level of abstraction, making them harder to decypher?

I'm asking because I think it's relevant to the different preprocessing that can be handled by the feature processors. For example, normalizing or converting to MFCCs seems like it would be something quite widespread among speech-based feature processors, do we want to have that in each implementation (abstraction-free) or will the goal be to upstream these methods in the parent class once we identify similarities among feature processors?

Yeah good question! To be honest, I'm not really sure yet. I would like to enforce the rule that feature extractors can only inherit from PreTrainedFeatureExtractor and no other feature extractor. IMO, the best approach to begin with is to limit (as you've suggested) the user-facing API for FeatureProcessor to __call__, from_pretrained(), save_pretrained() and maybe something like from_file() and then in the beginning.
I think a method like pad() is general enough to have this method in the beginning be implemented in PreTrainedFeatureExtractor because every extractor will need to do padding no matter what.

For pretty much all other methods (actually including normalization()), I would copy-paste them into each feature processor and make sure that they are private methods _normalize() so that we can later still do some refactoring here if needed.

So in general my strategy would be to have as little abstraction as possible - e.g. copy-paste classes such as those:

transformers/src/transformers/models/speech_to_text_transformer/tokenization_speech_to_text_transformer.py

Line 239 in 19c1457

class UtteranceCMVN:

to each feature extractor - and then when having more models maybe move some things upstream into the PretrainedFeatureExtractor file

LysandreJik · 2021-02-22T14:53:11Z

Thanks for explaining, sounds like a good approach to me! Thanks for drafting the proposal.

src/transformers/feature_extraction_utils.py

sgugger

I like this design a lot! I don't think we need to decide right now for normalize/from_file methods and we can always refactor those down the line (it's not a breaking change on the user side anyway).

I also completely agree that the feature processors should be loaded from one json file only (as long as we make it general enough),

src/transformers/models/wav2vec2/feature_processing_wav2vec2.py

tests/test_feature_extraction_common.py

src/transformers/feature_extraction_utils.py

sgugger

Very nice! I especially liked all the new tests!
One thing I am very worried about is the public functions/classes defined in two different files with the same name. This is a design that will create lots of problems in my opinion and in this case, it's for common utils that should only be defined in a given module (new if necessary).

sgugger · 2021-02-24T18:57:24Z

src/transformers/feature_extraction_utils.py

+    LONGEST = "longest"
+    MAX_LENGTH = "max_length"
+    DO_NOT_PAD = "do_not_pad"
+


All the objects up until here are common objects for our internal use. This is not a modeling file so I would expect all of those to be defined in one place and shared. This is especially important for objects like TensorType that are in the main init of transformers and should only be defined in one place.

Agree very much actually!

Will move them to file_utils.py -> think this is the cleanest option! The other option would be to just import them from tokenization_utils_base.py, but I think it's fair to move them since they have become more generic than. just tokenization. I don't think that there is a break in bcp.

sgugger · 2021-02-24T18:58:16Z

src/transformers/feature_extraction_utils.py

+    DO_NOT_PAD = "do_not_pad"
+
+
+class BatchFeature(UserDict):


This class should have some "copied from BatchEncoding" statements (and I think there is a common parent class to write here).

I added some, but I think the only function that really shares more or less all the code is the to(...) function -> all other functions are quite different from each other (mainly because there is no _encodings attribute for BatchFeature) so think it's ok to not have a common parent class for now?

src/transformers/feature_extraction_utils.py

src/transformers/file_utils.py

src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py

src/transformers/models/wav2vec2/tokenization_wav2vec2.py

LysandreJik

Thank you for your work on the feature extractor/processor/tokenizer. I think the API you've designed looks great.

I've left a few comments, mostly nitpicks, as I'm overall very fond of the API proposed.

src/transformers/feature_extraction_utils.py

src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py

tests/test_feature_extraction_common.py

tests/test_feature_extraction_wav2vec2.py

tests/test_processor_wav2vec2.py

patil-suraj

The API looks really great! Thanks a lot, @patrickvonplaten.

I just left a couple of nits. Apart from the attention_mask being 1D, everything looks great.

src/transformers/feature_extraction_utils.py

src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py

src/transformers/feature_extraction_utils.py

src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py

src/transformers/models/wav2vec2/processing_wav2vec2.py

tests/test_feature_extraction_common.py

tests/test_feature_extraction_wav2vec2.py

sgugger

Thanks for updating! Just have one more nit and it should be good to merge!

docs/source/internal/feature_extraction_utils.rst

patrickvonplaten added 2 commits February 22, 2021 11:17

push to show

afe2a24

small improvement

f70b70e

patrickvonplaten requested review from LysandreJik, patil-suraj, sgugger and thomwolf February 22, 2021 09:29

small improvement

1b9152e

LysandreJik reviewed Feb 22, 2021

View reviewed changes

src/transformers/feature_extraction_utils.py Show resolved Hide resolved

LysandreJik approved these changes Feb 22, 2021

View reviewed changes

patrickvonplaten commented Feb 22, 2021

View reviewed changes

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved

Update src/transformers/feature_extraction_utils.py

d135f74

patrickvonplaten commented Feb 22, 2021

View reviewed changes

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved

Update src/transformers/feature_extraction_utils.py

5246685

sgugger approved these changes Feb 22, 2021

View reviewed changes

src/transformers/models/wav2vec2/feature_processing_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits February 23, 2021 11:15

implement base

b6e3d68

add common tests

b315373

make all tests pass for wav2vec2

3302c28

patil-suraj mentioned this pull request Feb 23, 2021

Speech2TextTransformer #10175

Merged

5 tasks

patrickvonplaten added 4 commits February 24, 2021 10:19

make padding work & add more tests

8b883fe

finalize feature extractor utils

93962ca

add call method to feature extraction

55d2705

finalize feature processor

b496346

patrickvonplaten changed the title ~~[SpeechProcessor] Design proposal~~ [Speech MultiModalDesign] PretrainedFeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer Feb 24, 2021

patrickvonplaten added 5 commits February 24, 2021 16:29

finish tokenizer

5239bf7

finish general processor design

c17859e

finish tests

f64f25c

typo

08e3458

remove bogus file

ed9543a

patrickvonplaten added 3 commits February 24, 2021 20:58

finish docs

80edb8b

small fix

7189c24

correct docs

6652130

patrickvonplaten commented Feb 24, 2021

View reviewed changes

tests/test_feature_extraction_common.py Show resolved Hide resolved

patrickvonplaten commented Feb 24, 2021

View reviewed changes

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved

patrickvonplaten commented Feb 24, 2021

View reviewed changes

src/transformers/feature_extraction_utils.py Show resolved Hide resolved

sgugger approved these changes Feb 24, 2021

View reviewed changes

LysandreJik approved these changes Feb 24, 2021

View reviewed changes

patil-suraj approved these changes Feb 25, 2021

View reviewed changes

patrickvonplaten added 8 commits February 25, 2021 11:54

save intermediate

960c27c

load changes

d389a9e

apply changes

900fee6

apply changes to doc

7482eee

change tests

08b3ac6

apply surajs recommend

e2ae501

final changes

bfddc7f

Merge branch 'master' into speech_processor_design

bad66a8

patrickvonplaten commented Feb 25, 2021

View reviewed changes

src/transformers/feature_extraction_utils.py Outdated Show resolved Hide resolved

patrickvonplaten commented Feb 25, 2021

View reviewed changes

src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten commented Feb 25, 2021

View reviewed changes

src/transformers/models/wav2vec2/processing_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten commented Feb 25, 2021

View reviewed changes

tests/test_feature_extraction_common.py Outdated Show resolved Hide resolved

patrickvonplaten commented Feb 25, 2021

View reviewed changes

tests/test_feature_extraction_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits February 25, 2021 14:22

Apply suggestions from code review

d8ef58f

fix typo

6b5cc29

fix import

d1aa8ea

sgugger approved these changes Feb 25, 2021

View reviewed changes

docs/source/internal/feature_extraction_utils.rst Outdated Show resolved Hide resolved

correct docstring

791fbee

patrickvonplaten merged commit cb38ffc into huggingface:master Feb 25, 2021

patrickvonplaten deleted the speech_processor_design branch February 25, 2021 14:42

NielsRogge mentioned this pull request Mar 4, 2021

Add Vision Transformer + ViTFeatureExtractor #10513

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

patrickvonplaten commented Feb 22, 2021 •

edited

LysandreJik left a comment

patrickvonplaten commented Feb 22, 2021

LysandreJik commented Feb 22, 2021

sgugger left a comment

sgugger left a comment

sgugger Feb 24, 2021

patrickvonplaten Feb 25, 2021

patrickvonplaten Feb 25, 2021 •

edited

sgugger Feb 24, 2021

patrickvonplaten Feb 25, 2021

LysandreJik left a comment

patil-suraj left a comment

sgugger left a comment

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

Conversation

patrickvonplaten commented Feb 22, 2021 • edited

🚨🚨🚨IMPORTANT Wav2Vec2 repositories that were added before 4.4 should make sure to manually add a feature extractor class.

What does this PR do?

Backwards breaking compatibility

Review

LysandreJik left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Feb 22, 2021

LysandreJik commented Feb 22, 2021

sgugger left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger Feb 24, 2021

Choose a reason for hiding this comment

patrickvonplaten Feb 25, 2021

Choose a reason for hiding this comment

patrickvonplaten Feb 25, 2021 • edited

Choose a reason for hiding this comment

sgugger Feb 24, 2021

Choose a reason for hiding this comment

patrickvonplaten Feb 25, 2021

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

patil-suraj left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Feb 22, 2021 •

edited

patrickvonplaten Feb 25, 2021 •

edited