[Sequence Feature Extraction] Add truncation #12804

patrickvonplaten · 2021-07-20T09:36:59Z

What does this PR do?

This PR adds truncation to speech-related feature extractors. It should enable use cases such as: #12774

Different to our tokenizers we allow truncation to be just True or False => there is no "truncation strategy". The reason is that for feature extractors the input cannot be a "pair" of input sequences so there is essentially just one use case for truncation that applies to all inputs.

The logic is equivalent to Tokenizers with a small exception in error handling. The differences are shown in example 1. & 2. in the following:

truncation=True, no padding, no max_length. IMO this should be an error because it's not clear what should be done here. In Tokenizers we don't throw an error, but simply don't do anything. Throwing an error is better here IMO

from transformers import Wav2Vec2FeatureExtractor, BatchFeature
feat_extractor = Wav2Vec2FeatureExtractor()
dummy_inputs = BatchFeature({"input_values": [[0.1, 0.2, 0.3], [0.1]]})

feat_extractor.pad(dummy_inputs, truncation=True) # -> throws error since `max_length` is not defined

truncation=True, "longest" padding, no max_length. IMO this should be an error because it doesn't make sense to "truncate" and to pad to the longest tensor in the batch. If we pad to the longest batch, we can't truncate. In Tokenizers we don't throw an error, but simply don't do anything. Throwing an error is better here IMO

from transformers import Wav2Vec2FeatureExtractor, BatchFeature
feat_extractor = Wav2Vec2FeatureExtractor()
dummy_inputs = BatchFeature({"input_values": [[0.1, 0.2, 0.3], [0.1]]})

feat_extractor.pad(dummy_inputs, truncation=True, padding="longest") # -> throws error since `max_length` is not defined and padding not "max_length"

truncation=True, "max_length" padding. Here the logic is equivalent to tokenizers

from transformers import Wav2Vec2FeatureExtractor, BatchFeature
feat_extractor = Wav2Vec2FeatureExtractor()
dummy_inputs = BatchFeature({"input_values": [[0.1, 0.2, 0.3], [0.1]]})

feat_extractor.pad(dummy_inputs, truncation=True, max_length=2, padding="max_length") # -> output shape is [2, 2]

In addition pad_to_multiple_of works correctly and equivalent to the Tokenizers. Tests are added to cover all possible use cases.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…mers

src/transformers/feature_extraction_sequence_utils.py

sgugger

Thanks for adding this new feature! My main comment is on the default for truncation (why have None if it's just a True/False flag), the rest are just nits.

src/transformers/feature_extraction_sequence_utils.py

tests/test_sequence_feature_extraction_common.py

LysandreJik

LGTM! Great tests.

src/transformers/feature_extraction_sequence_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…into add_truncation_to_feature_extract

patrickvonplaten added 16 commits May 19, 2021 09:07

fix_torch_device_generate_test

f7197df

remove @

5f70018

Merge branch 'master' of https://github.com/patrickvonplaten/transfor…

2da7a31

…mers

Merge branch 'master' of https://github.com/huggingface/transformers

31c1132

Merge branch 'master' of https://github.com/huggingface/transformers

bf193bc

Merge branch 'master' of https://github.com/huggingface/transformers

2f09a75

Merge branch 'master' of https://github.com/huggingface/transformers

b42f12f

Merge branch 'master' of https://github.com/huggingface/transformers

ca3d9d0

Merge branch 'master' of https://github.com/huggingface/transformers

012433f

Merge branch 'master' of https://github.com/huggingface/transformers

5f0b7d1

Merge branch 'master' of https://github.com/huggingface/transformers

ae7ef40

Merge branch 'master' of https://github.com/huggingface/transformers

016fa5d

Merge branch 'master' of https://github.com/huggingface/transformers

9f5b0eb

Merge branch 'master' of https://github.com/huggingface/transformers

55b7109

Merge branch 'master' of https://github.com/huggingface/transformers

31ab5cb

add truncate

dea781c

patrickvonplaten changed the title ~~[Sequence Feature Extraction] Add truncation~~ [WIP}[Sequence Feature Extraction] Add truncation Jul 20, 2021

patrickvonplaten changed the title ~~[WIP}[Sequence Feature Extraction] Add truncation~~ [WIP][Sequence Feature Extraction] Add truncation Jul 20, 2021

finish

1fcd2d8

patrickvonplaten changed the title ~~[WIP][Sequence Feature Extraction] Add truncation~~ [Sequence Feature Extraction] Add truncation Jul 20, 2021

patrickvonplaten linked an issue Jul 20, 2021 that may be closed by this pull request

max_length parameter in Wav2Vec2FeatureExtractor doesn't affect #12774

Closed

2 tasks

patrickvonplaten commented Jul 20, 2021

View reviewed changes

src/transformers/feature_extraction_sequence_utils.py Show resolved Hide resolved

patrickvonplaten commented Jul 20, 2021

View reviewed changes

src/transformers/feature_extraction_sequence_utils.py Outdated Show resolved Hide resolved

correct test

cc9260f

patrickvonplaten requested review from LysandreJik, sgugger and patil-suraj July 20, 2021 11:11

sgugger approved these changes Jul 20, 2021

View reviewed changes

LysandreJik approved these changes Jul 21, 2021

View reviewed changes

src/transformers/feature_extraction_sequence_utils.py Show resolved Hide resolved

Apply suggestions from code review

bd9b07f

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

patrickvonplaten added 11 commits July 23, 2021 12:18

clean tests

7063679

final clean

fa2a217

Merge branch 'master' of https://github.com/huggingface/transformers …

c623270

…into add_truncation_to_feature_extract

Merge branch 'master' of https://github.com/huggingface/transformers …

54e00ca

…into add_truncation_to_feature_extract

correct normalization for truncation

c3c4fff

remove casting

3ca1204

up

dc0f228

save intermed

04489d0

finish

ee3c290

finish

04048db

correct

4de2db7

patrickvonplaten merged commit f6e2544 into huggingface:master Jul 23, 2021

patrickvonplaten deleted the add_truncation_to_feature_extract branch July 23, 2021 15:53

patrickvonplaten mentioned this pull request Aug 9, 2021

[Feature Processing Sequence] Remove duplicated code #13051

Merged

anton-l mentioned this pull request Sep 8, 2021

Fix integration tests for TFWav2Vec2 and TFHubert #13480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sequence Feature Extraction] Add truncation #12804

[Sequence Feature Extraction] Add truncation #12804

patrickvonplaten commented Jul 20, 2021 •

edited

sgugger left a comment

LysandreJik left a comment

[Sequence Feature Extraction] Add truncation #12804

[Sequence Feature Extraction] Add truncation #12804

Conversation

patrickvonplaten commented Jul 20, 2021 • edited

What does this PR do?

Before submitting

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Jul 20, 2021 •

edited