Token level timestamps for long-form generation in Whisper #29148

zucchini-nlp · 2024-02-20T17:32:31Z

What does this PR do?

Continuation of PR #28984. Adds token level timestamps for long-form generation. The previous PR had a quite different of way to add timestamps, specifically by calling extract_timestamps for each segment and each batch separately. I believe, it can be done in one batch, and then divided into segments the same way sequences are divided.

The final timestamps are already aligned with the total length, so there is not need to add start_time for each segment. Although, I am not sure if that is what we want to have, so I can remove this "total duration alignment" is needed.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi
@patrickvonplaten
@gante ?

HuggingFaceDocBuilderDev · 2024-02-20T17:52:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

In general it looks good to me, but I'd like to defer the approval to our Whisper expert @sanchit-gandhi :D

whisper.py

sanchit-gandhi

Thanks @zucchini-nlp - sounds good to me aligning the word-level timestamps on a per-segment level (rather than on the full sequence-level). We'll probably need to propagate some changes forward to the ASR pipeline so that it's compatible with these generation related changes, but we can do this in a follow-up PR

sanchit-gandhi · 2024-02-23T13:37:39Z

tests/models/whisper/test_modeling_whisper.py

+        self.assertListEqual(tokens_shape, token_timestamps_shape)
+
+        # fmt: off
+        EXPECTED_OUTPUT = [


Are these values taken from the original repo? Or inspected from Transformers Whisper and deemed to be correct?

Inspected from Transformers Whisper and deemed to be correct. I am not sure how the tests for short-form timestamps were written, so I just took an audio and copied model outputs.

sanchit-gandhi · 2024-02-23T13:44:52Z

Also cc @ArthurZucker for Whisper-related timestamps

ArthurZucker

Let's add a small end to end test with explicit outputs, will help us in the long run! Otherwise LGTM

ArthurZucker · 2024-02-27T00:40:54Z

tests/models/whisper/test_modeling_whisper.py

+
+        for segment, exp_segment in zip(generate_outputs["segments"][0], EXPECTED_OUTPUT):
+            self.assertTrue(torch.allclose(segment["token_timestamps"], exp_segment))
+


Could we add a small test like test_return_timestamps_in_preprocess in /Users/arthurzucker/Work/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.py just to make sure we have something explicit like

'chunks': [ {'text': ' Conquered', 'timestamp': (0.5, 1.2)}, {'text': ' returned', 'timestamp': (1.2, 1.64)}, {'text': ' to', 'timestamp': (1.64, 1.84)}, {'text': ' its', 'timestamp': (1.84, 2.02)}, {'text': ' place', 'timestamp': (2.02, 2.28)}, {'text': ' amidst', 'timestamp': (2.28, 2.8)}, {'text': ' the', 'timestamp': (2.8, 2.98)}, {'text': ' tents.', 'timestamp': (2.98, 3.48)}, ], ```!

sure, done!

zucchini-nlp · 2024-02-27T14:17:18Z

@amyeroberts ready for review here

Sorry, a mistag. Got a review from Arthur, so we merged it :)

zucchini-nlp added 4 commits February 20, 2024 14:02

long-sequence generation timestamps

7e5ed3c

Merge 'main' into whisper

8eee9dd

add test and fix

1bd1c06

remove breakpoint

12b76a2

codestyle

f3ad8e6

gante reviewed Feb 21, 2024

View reviewed changes

whisper.py Outdated Show resolved Hide resolved

delete my files

a35900c

sanchit-gandhi approved these changes Feb 23, 2024

View reviewed changes

gante requested a review from ArthurZucker February 26, 2024 17:57

ArthurZucker approved these changes Feb 27, 2024

View reviewed changes

add tests for asr pipeline

1aa82ee

gante merged commit ddf7ac4 into huggingface:main Feb 27, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token level timestamps for long-form generation in Whisper #29148

Token level timestamps for long-form generation in Whisper #29148

zucchini-nlp commented Feb 20, 2024

HuggingFaceDocBuilderDev commented Feb 20, 2024

gante left a comment

sanchit-gandhi left a comment

sanchit-gandhi Feb 23, 2024

zucchini-nlp Feb 23, 2024

sanchit-gandhi commented Feb 23, 2024

ArthurZucker left a comment

ArthurZucker Feb 27, 2024

zucchini-nlp Feb 27, 2024

zucchini-nlp commented Feb 27, 2024 •

edited


		for segment, exp_segment in zip(generate_outputs["segments"][0], EXPECTED_OUTPUT):
		self.assertTrue(torch.allclose(segment["token_timestamps"], exp_segment))

Token level timestamps for long-form generation in Whisper #29148

Token level timestamps for long-form generation in Whisper #29148

Conversation

zucchini-nlp commented Feb 20, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Feb 20, 2024

gante left a comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Feb 23, 2024

Choose a reason for hiding this comment

zucchini-nlp Feb 23, 2024

Choose a reason for hiding this comment

sanchit-gandhi commented Feb 23, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 27, 2024

Choose a reason for hiding this comment

zucchini-nlp Feb 27, 2024

Choose a reason for hiding this comment

zucchini-nlp commented Feb 27, 2024 • edited

zucchini-nlp commented Feb 27, 2024 •

edited