Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sequence Feature Extraction] Add truncation #12804

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Jul 20, 2021

What does this PR do?

This PR adds truncation to speech-related feature extractors. It should enable use cases such as: #12774

Different to our tokenizers we allow truncation to be just True or False => there is no "truncation strategy". The reason is that for feature extractors the input cannot be a "pair" of input sequences so there is essentially just one use case for truncation that applies to all inputs.

The logic is equivalent to Tokenizers with a small exception in error handling. The differences are shown in example 1. & 2. in the following:

  1. truncation=True, no padding, no max_length. IMO this should be an error because it's not clear what should be done here. In Tokenizers we don't throw an error, but simply don't do anything. Throwing an error is better here IMO
from transformers import Wav2Vec2FeatureExtractor, BatchFeature
feat_extractor = Wav2Vec2FeatureExtractor()
dummy_inputs = BatchFeature({"input_values": [[0.1, 0.2, 0.3], [0.1]]})

feat_extractor.pad(dummy_inputs, truncation=True) # -> throws error since `max_length` is not defined 
  1. truncation=True, "longest" padding, no max_length. IMO this should be an error because it doesn't make sense to "truncate" and to pad to the longest tensor in the batch. If we pad to the longest batch, we can't truncate. In Tokenizers we don't throw an error, but simply don't do anything. Throwing an error is better here IMO
from transformers import Wav2Vec2FeatureExtractor, BatchFeature
feat_extractor = Wav2Vec2FeatureExtractor()
dummy_inputs = BatchFeature({"input_values": [[0.1, 0.2, 0.3], [0.1]]})

feat_extractor.pad(dummy_inputs, truncation=True, padding="longest") # -> throws error since `max_length` is not defined and padding not "max_length"
  1. truncation=True, "max_length" padding. Here the logic is equivalent to tokenizers
from transformers import Wav2Vec2FeatureExtractor, BatchFeature
feat_extractor = Wav2Vec2FeatureExtractor()
dummy_inputs = BatchFeature({"input_values": [[0.1, 0.2, 0.3], [0.1]]})

feat_extractor.pad(dummy_inputs, truncation=True, max_length=2, padding="max_length") # -> output shape is [2, 2]

In addition pad_to_multiple_of works correctly and equivalent to the Tokenizers. Tests are added to cover all possible use cases.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten patrickvonplaten changed the title [Sequence Feature Extraction] Add truncation [WIP}[Sequence Feature Extraction] Add truncation Jul 20, 2021
@patrickvonplaten patrickvonplaten changed the title [WIP}[Sequence Feature Extraction] Add truncation [WIP][Sequence Feature Extraction] Add truncation Jul 20, 2021
@patrickvonplaten patrickvonplaten changed the title [WIP][Sequence Feature Extraction] Add truncation [Sequence Feature Extraction] Add truncation Jul 20, 2021
@patrickvonplaten patrickvonplaten linked an issue Jul 20, 2021 that may be closed by this pull request
2 tasks
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new feature! My main comment is on the default for truncation (why have None if it's just a True/False flag), the rest are just nits.

src/transformers/feature_extraction_sequence_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_sequence_utils.py Outdated Show resolved Hide resolved
src/transformers/feature_extraction_sequence_utils.py Outdated Show resolved Hide resolved
tests/test_sequence_feature_extraction_common.py Outdated Show resolved Hide resolved
tests/test_sequence_feature_extraction_common.py Outdated Show resolved Hide resolved
tests/test_sequence_feature_extraction_common.py Outdated Show resolved Hide resolved
tests/test_sequence_feature_extraction_common.py Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great tests.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@patrickvonplaten patrickvonplaten merged commit f6e2544 into huggingface:master Jul 23, 2021
@patrickvonplaten patrickvonplaten deleted the add_truncation_to_feature_extract branch July 23, 2021 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

max_length parameter in Wav2Vec2FeatureExtractor doesn't affect
3 participants