Skip to content

[voxtral-realtime] get more perfs!#44142

Open
eustlb wants to merge 1 commit intomainfrom
voxtral-realtime-perfs
Open

[voxtral-realtime] get more perfs!#44142
eustlb wants to merge 1 commit intomainfrom
voxtral-realtime-perfs

Conversation

@eustlb
Copy link
Copy Markdown
Contributor

@eustlb eustlb commented Feb 18, 2026

What does this PR do?

so @Deep-unlearning noticed, benchmarking for the Open ASR leaderbaord, that the current implem is particularly slow. That would make sense since we go through every layer of the encoder forward, and that the streaming paradigm mean the genrate way more tokens (since they are time-aligned).

Profiling shows indeed:
voxtral
Screenshot 2026-02-18 at 21 48 53

voxtral-realtime
Screenshot 2026-02-18 at 21 48 35

But we don't necessarily have to do it! Let's just run the encoder fully once at the beginning when we can.

@eustlb
Copy link
Copy Markdown
Contributor Author

eustlb commented Feb 18, 2026

run-slow: voxtral-realtime

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: voxtral_realtime

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eustlb eustlb changed the title [voxtral] get more perfs! [voxtral-realtime] get more perfs! Feb 19, 2026
@Deep-unlearning
Copy link
Copy Markdown
Contributor

Benchmarks before:


Filtering models by id: mistralai/Voxtral-Mini-4B-Realtime-2602
********************************************************************************
Results per dataset:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 17.06 %, RTFx = 39.09
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 11.85 %, RTFx = 43.52
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 10.39 %, RTFx = 48.64
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.09 %, RTFx = 45.20
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.49 %, RTFx = 44.99
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.42 %, RTFx = 49.94
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 3.79 %, RTFx = 41.33
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 8.24 %, RTFx = 39.35

********************************************************************************
Composite Results:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602: WER = 7.67 %
mistralai/Voxtral-Mini-4B-Realtime-2602: RTFx = 47.90
********************************************************************************

After:

Filtering models by id: mistralai/Voxtral-Mini-4B-Realtime-2602
********************************************************************************
Results per dataset:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 17.07 %, RTFx = 72.11
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 11.84 %, RTFx = 
86.55
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 10.38 %, RTFx = 
90.34
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.08 %, RT
Fx = 85.78
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.52 %, R
TFx = 83.86
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.42 %, RTFx = 9
9.57
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 3.79 %, RTFx = 81.8
7
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 8.34 %, RTFx = 81
.16

********************************************************************************
Composite Results:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602: WER = 7.68 %
mistralai/Voxtral-Mini-4B-Realtime-2602: RTFx = 93.32
********************************************************************************

@xenova
Copy link
Copy Markdown
Contributor

xenova commented Feb 21, 2026

maybe also to check: when running the streaming sample code, I get this

Traceback (most recent call last):
  File ".../voxtral_realtime/run.py", line 19, in <module>
    xaudio = np.pad(audio, (0, processor.num_right_pad_tokens * processor.raw_audio_length_per_tok))
                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for *: 'method' and 'int'

maybe mistral changed their API? (since it looks like we just return properties from self.mistral_common_audio_config)

Update: Indeed (mistralai/mistral-common@a54e57f)

@eustlb
Copy link
Copy Markdown
Contributor Author

eustlb commented Feb 23, 2026

Thanks for noticing @xenova 🙏! Opening another PR for that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants