[voxtral-realtime] get more perfs! by eustlb · Pull Request #44142 · huggingface/transformers

eustlb · 2026-02-18T21:44:11Z

What does this PR do?

so @Deep-unlearning noticed, benchmarking for the Open ASR leaderbaord, that the current implem is particularly slow. That would make sense since we go through every layer of the encoder forward, and that the streaming paradigm mean the genrate way more tokens (since they are time-aligned).

Profiling shows indeed:
voxtral

voxtral-realtime

But we don't necessarily have to do it! Let's just run the encoder fully once at the beginning when we can.

eustlb · 2026-02-18T21:44:23Z

run-slow: voxtral-realtime

github-actions · 2026-02-18T21:46:36Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: voxtral_realtime

HuggingFaceDocBuilderDev · 2026-02-18T21:53:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Deep-unlearning · 2026-02-21T08:39:53Z

Benchmarks before:


Filtering models by id: mistralai/Voxtral-Mini-4B-Realtime-2602
********************************************************************************
Results per dataset:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 17.06 %, RTFx = 39.09
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 11.85 %, RTFx = 43.52
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 10.39 %, RTFx = 48.64
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.09 %, RTFx = 45.20
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.49 %, RTFx = 44.99
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.42 %, RTFx = 49.94
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 3.79 %, RTFx = 41.33
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 8.24 %, RTFx = 39.35

********************************************************************************
Composite Results:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602: WER = 7.67 %
mistralai/Voxtral-Mini-4B-Realtime-2602: RTFx = 47.90
********************************************************************************

After:

Filtering models by id: mistralai/Voxtral-Mini-4B-Realtime-2602
********************************************************************************
Results per dataset:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 17.07 %, RTFx = 72.11
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 11.84 %, RTFx = 
86.55
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 10.38 %, RTFx = 
90.34
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 2.08 %, RT
Fx = 85.78
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 5.52 %, R
TFx = 83.86
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 2.42 %, RTFx = 9
9.57
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 3.79 %, RTFx = 81.8
7
mistralai/Voxtral-Mini-4B-Realtime-2602 | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 8.34 %, RTFx = 81
.16

********************************************************************************
Composite Results:
********************************************************************************
mistralai/Voxtral-Mini-4B-Realtime-2602: WER = 7.68 %
mistralai/Voxtral-Mini-4B-Realtime-2602: RTFx = 93.32
********************************************************************************

xenova · 2026-02-21T18:41:10Z

maybe also to check: when running the streaming sample code, I get this

Traceback (most recent call last):
  File ".../voxtral_realtime/run.py", line 19, in <module>
    xaudio = np.pad(audio, (0, processor.num_right_pad_tokens * processor.raw_audio_length_per_tok))
                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for *: 'method' and 'int'

maybe mistral changed their API? (since it looks like we just return properties from self.mistral_common_audio_config)

Update: Indeed (mistralai/mistral-common@a54e57f)

eustlb · 2026-02-23T17:25:28Z

Thanks for noticing @xenova 🙏! Opening another PR for that!

precompute by default

0d772ae

eustlb changed the title ~~[voxtral] get more perfs!~~ [voxtral-realtime] get more perfs! Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[voxtral-realtime] get more perfs!#44142

[voxtral-realtime] get more perfs!#44142
eustlb wants to merge 1 commit intomainfrom
voxtral-realtime-perfs

eustlb commented Feb 18, 2026

Uh oh!

eustlb commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 18, 2026

Uh oh!

Deep-unlearning commented Feb 21, 2026

Uh oh!

xenova commented Feb 21, 2026 •

edited

Loading

Uh oh!

eustlb commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

eustlb commented Feb 18, 2026

What does this PR do?

Uh oh!

eustlb commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 18, 2026

Uh oh!

Deep-unlearning commented Feb 21, 2026

Uh oh!

xenova commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eustlb commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xenova commented Feb 21, 2026 •

edited

Loading

eustlb commented Feb 23, 2026 •

edited

Loading