Batched extraction of speaker embeddings for Nemo word-level baseline #36

Lakoc · 2024-04-15T12:49:34Z

No description provided.

Lakoc · 2024-04-15T16:02:17Z

@microsoft-github-policy-service agree company="BUT"

nidleo

Thank you for contributing to this repo!

If you have the numbers, could you please mention in the PR's description the expected speed up from this change?

Lakoc · 2024-04-16T09:07:41Z

Sure, the previous version of the extractor runs 1 minute and 36 seconds on NVIDIA RTX A5000 on recording MTG_30860_plaza_0, and the batched version runs 31 s. It can be further improved by inferring multiple segments in a batch. However, I see that Whisper per-channel inference is now the primary problem. I suggest using the HuggingFace implementation of Whisper long-form decoding.

nidleo · 2024-04-16T17:35:51Z

I suggest using the HuggingFace implementation of Whisper long-form decoding.

If you know of an implementation as accurate as Whisper large-v3 and also faster (or at least one that supports batching independent streams), it would be great to know. My impression was that there's a speed/accuracy tradeoff.

Lakoc · 2024-04-18T16:14:48Z

HuggingFace Transformers previously utilized a chunked algorithm. Recently, they have also introduced a sequential algorithm (huggingface/transformers#27658), similar to the one used in the original openai repository, supporting batched inference, flash attention, and speculative decoding.

After experimenting with it, I found that the text outputs are satisfactory. However, I noticed some noise in the timestamps. To address this issue, I integrated an external model for force alignment. As a result, the baseline ASR now runs in approximately 2-3 minutes with large-v3 compared to 8 minutes with the OpenAI implementation on an RTX A5000. Although I initially considered creating a PR, I ultimately decided against it due to the reliance on an additional model for force alignment.

Additionally, there is a helpful blog post that compares various Whisper decoding implementations: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription.

Batched extraction of speaker embeddings for Nemo word-level baseline

f49d6d9

nidleo approved these changes Apr 15, 2024

View reviewed changes

nidleo merged commit 0993574 into microsoft:main Apr 16, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched extraction of speaker embeddings for Nemo word-level baseline #36

Batched extraction of speaker embeddings for Nemo word-level baseline #36

Lakoc commented Apr 15, 2024

Lakoc commented Apr 15, 2024

nidleo left a comment

Lakoc commented Apr 16, 2024 •

edited

Loading

nidleo commented Apr 16, 2024

Lakoc commented Apr 18, 2024 •

edited

Loading

Batched extraction of speaker embeddings for Nemo word-level baseline #36

Batched extraction of speaker embeddings for Nemo word-level baseline #36

Conversation

Lakoc commented Apr 15, 2024

Lakoc commented Apr 15, 2024

nidleo left a comment

Choose a reason for hiding this comment

Lakoc commented Apr 16, 2024 • edited Loading

nidleo commented Apr 16, 2024

Lakoc commented Apr 18, 2024 • edited Loading

Lakoc commented Apr 16, 2024 •

edited

Loading

Lakoc commented Apr 18, 2024 •

edited

Loading