Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched extraction of speaker embeddings for Nemo word-level baseline #36

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

Lakoc
Copy link

@Lakoc Lakoc commented Apr 15, 2024

No description provided.

@Lakoc
Copy link
Author

Lakoc commented Apr 15, 2024

@microsoft-github-policy-service agree company="BUT"

Copy link
Collaborator

@nidleo nidleo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing to this repo!

If you have the numbers, could you please mention in the PR's description the expected speed up from this change?

@Lakoc
Copy link
Author

Lakoc commented Apr 16, 2024

Sure, the previous version of the extractor runs 1 minute and 36 seconds on NVIDIA RTX A5000 on recording MTG_30860_plaza_0, and the batched version runs 31 s. It can be further improved by inferring multiple segments in a batch. However, I see that Whisper per-channel inference is now the primary problem. I suggest using the HuggingFace implementation of Whisper long-form decoding.

@nidleo nidleo merged commit 0993574 into microsoft:main Apr 16, 2024
1 check passed
@nidleo
Copy link
Collaborator

nidleo commented Apr 16, 2024

I suggest using the HuggingFace implementation of Whisper long-form decoding.

If you know of an implementation as accurate as Whisper large-v3 and also faster (or at least one that supports batching independent streams), it would be great to know. My impression was that there's a speed/accuracy tradeoff.

@Lakoc
Copy link
Author

Lakoc commented Apr 18, 2024

HuggingFace Transformers previously utilized a chunked algorithm. Recently, they have also introduced a sequential algorithm (huggingface/transformers#27658), similar to the one used in the original openai repository, supporting batched inference, flash attention, and speculative decoding.

After experimenting with it, I found that the text outputs are satisfactory. However, I noticed some noise in the timestamps. To address this issue, I integrated an external model for force alignment. As a result, the baseline ASR now runs in approximately 2-3 minutes with large-v3 compared to 8 minutes with the OpenAI implementation on an RTX A5000. Although I initially considered creating a PR, I ultimately decided against it due to the reliance on an additional model for force alignment.

Additionally, there is a helpful blog post that compares various Whisper decoding implementations: https://amgadhasan.substack.com/p/sota-asr-tooling-long-form-transcription.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants