Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PretrainedFeatureExtractor] + Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2Tokenizer #10324

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
afe2a24
push to show
patrickvonplaten Feb 22, 2021
f70b70e
small improvement
patrickvonplaten Feb 22, 2021
1b9152e
small improvement
patrickvonplaten Feb 22, 2021
d135f74
Update src/transformers/feature_extraction_utils.py
patrickvonplaten Feb 22, 2021
5246685
Update src/transformers/feature_extraction_utils.py
patrickvonplaten Feb 22, 2021
b6e3d68
implement base
patrickvonplaten Feb 23, 2021
b315373
add common tests
patrickvonplaten Feb 23, 2021
3302c28
make all tests pass for wav2vec2
patrickvonplaten Feb 23, 2021
8b883fe
make padding work & add more tests
patrickvonplaten Feb 24, 2021
93962ca
finalize feature extractor utils
patrickvonplaten Feb 24, 2021
55d2705
add call method to feature extraction
patrickvonplaten Feb 24, 2021
b496346
finalize feature processor
patrickvonplaten Feb 24, 2021
5239bf7
finish tokenizer
patrickvonplaten Feb 24, 2021
c17859e
finish general processor design
patrickvonplaten Feb 24, 2021
f64f25c
finish tests
patrickvonplaten Feb 24, 2021
08e3458
typo
patrickvonplaten Feb 24, 2021
ed9543a
remove bogus file
patrickvonplaten Feb 24, 2021
dca668a
finish docstring
patrickvonplaten Feb 24, 2021
4c7c013
add docs
patrickvonplaten Feb 24, 2021
80edb8b
finish docs
patrickvonplaten Feb 24, 2021
7189c24
small fix
patrickvonplaten Feb 24, 2021
6652130
correct docs
patrickvonplaten Feb 24, 2021
960c27c
save intermediate
patrickvonplaten Feb 25, 2021
d389a9e
load changes
patrickvonplaten Feb 25, 2021
900fee6
apply changes
patrickvonplaten Feb 25, 2021
7482eee
apply changes to doc
patrickvonplaten Feb 25, 2021
08b3ac6
change tests
patrickvonplaten Feb 25, 2021
e2ae501
apply surajs recommend
patrickvonplaten Feb 25, 2021
bfddc7f
final changes
patrickvonplaten Feb 25, 2021
bad66a8
Merge branch 'master' into speech_processor_design
patrickvonplaten Feb 25, 2021
d8ef58f
Apply suggestions from code review
patrickvonplaten Feb 25, 2021
6b5cc29
fix typo
patrickvonplaten Feb 25, 2021
d1aa8ea
fix import
patrickvonplaten Feb 25, 2021
791fbee
correct docstring
patrickvonplaten Feb 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
65 changes: 65 additions & 0 deletions src/transformers/feature_extraction_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team.
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Feature extraction classes for python tokenizers.
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the objects up until here are common objects for our internal use. This is not a modeling file so I would expect all of those to be defined in one place and shared. This is especially important for objects like TensorType that are in the main init of transformers and should only be defined in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree very much actually!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move them to file_utils.py -> think this is the cleanest option! The other option would be to just import them from tokenization_utils_base.py, but I think it's fair to move them since they have become more generic than. just tokenization. I don't think that there is a break in bcp.


class BatchFeature(UserDict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should have some "copied from BatchEncoding" statements (and I think there is a common parent class to write here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some, but I think the only function that really shares more or less all the code is the to(...) function -> all other functions are quite different from each other (mainly because there is no _encodings attribute for BatchFeature) so think it's ok to not have a common parent class for now?

""""""

def __init__(
self,
data: Optional[Dict[str, Any]] = None,
encoding: Optional[Union[EncodingFast, Sequence[EncodingFast]]] = None,
tensor_type: Union[None, str, TensorType] = None,
prepend_batch_axis: bool = False,
n_sequences: Optional[int] = None,
):
super().__init__(data)
# add similar functionality as BatchEncoding


class PreTrainedFeatureExtractor:
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved
"""
This is a general feature extraction class for speech recognition
"""

def __init__(self, **kwargs):
super().__init__(**kwargs)
# IMPORTANT: Feature Extractor are always deterministic -> they are never trained
# in any way like Tokenizers are -> therefore all configuration params should be
# stored in a json config
self.sampling_rate = kwargs.get("sampling_rate", None)
self.pad_vector = kwargs.get("pad_vector", None)
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved
self.feature_dim = kwargs.get("feature_dim", None) # this will be 1 for Wav2Vec2, but 768 for Speech2TextTransformers

def pad(self, feature: BatchFeature):
"""
Implement general padding method
"""
pass

def from_pretained(self, path):
"""
General loading method
"""
pass

def save_pretrained(self, path):
"""
General saving method
"""
pass
86 changes: 86 additions & 0 deletions src/transformers/models/wav2vec2/feature_processing_wav2vec2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Speech processor class for Wav2Vec2
"""


# NOTE inheritance from feature extractor
class Wav2Vec2FeatureExtractor(PreTrainedFeatureExtractor):
def __init__(self, **kwargs):
super().__init__(**kwargs)

def __call__(self, raw_speech):
"""
Implement the call method
"""
pass


# NOTE inheritance from tokenizer
class Wav2Vec2Tokenizer(PreTrainedTokenizer):
def __init__(self, **kwargs):
super().__init__(**kwargs)

def __call__(self, text):
"""
Implement encoding functionality
"""
pass

def _decode(self, text):
"""
Implement decoding functionality
"""
pass


class Wav2Vec2Processor:
def __init__(self, feature_extractor, tokenizer):
self.feature_extractor = feature_extractor
self.tokenizer = tokenizer
self.current_processor = self.feature_extractor
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved

def save_pretrained(self, pretrained_model_name_or_path):
self.feature_extractor.save_pretrained(pretrained_model_name_or_path)
self.tokenizer.save_pretrained(pretrained_model_name_or_path)

@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
# will look for a `feature_extractor_config.json` file
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(pretrained_model_name_or_path)
# will look for the tokenizer files
tokenizer = Wav2Vec2Tokenizer.from_pretrained(pretrained_model_name_or_path)

return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)

def __call__(self, *args, **kwargs):
return self.current_processor(*args, **kwargs)

def batch_decode(self, *args, **kwargs):
return self.tokenizer.batch_decode(*args, **kwargs)

def decode(self, *args, **kwargs):
return self.tokenizer.decode(*args, **kwargs)

@contextmanager
def as_target_tokenizer(self):
"""
Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to
sequence-to-sequence models that need a slightly different processing for the labels.
"""
self.current_processor = self.tokenizer
yield
self.current_processor = self.feature_extractor