Fix flaky test for multimodal LLMs by Rocketknight1 · Pull Request #43944 · huggingface/transformers

Rocketknight1 · 2026-02-12T12:55:52Z

We have flaky test failures in tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_generate_continue_from_past_key_values. The cause is that the logic in this test drops multimodal inputs (because it's passing past_key_values to generate instead), but it fails to drop audio inputs correctly. This causes flaky failures in LLMs like Qwen3OmniMoE, which I think occur when the audio encoder creates audio features that overwrite the embeddings for generated tokens (maybe when the tokens have the audio_token_id value?)

I'm not 100% sure of the cause, but the test fails 3% of the time without this fix and 0% with it, so I'm pretty sure dropping inputs correctly is the solution!

cc @zucchini-nlp

zucchini-nlp

Approving so you can merge after checking that the test passes for other audio models :)

zucchini-nlp · 2026-02-12T13:00:32Z

tests/generation/test_utils.py

-                if ("pixel" in key or key in ["image_patches", "input_feature"]) and key != model.main_input_name:
+                if (
+                    "pixel" in key
+                    or key in ["image_patches", "input_feature", "input_features", "feature_attention_mask"]


input_features is a quite common name across audio-models like whisper. Can you verify that the test passes for all models?
We can as well override it in qwen3-omni-moe test file, if it fails for other audio models

Yes! Dropping that input is intentional even when it's present on the model, unless it's the model_main_input. Because we're passing past_key_values to generation, we don't actually want to run multimodal inputs through the encoder again, and doing that is what caused the failure in the past.

HuggingFaceDocBuilderDev · 2026-02-12T13:05:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-02-12T13:07:47Z

Btw, the failing Bark was fixed on main :)

Rocketknight1 · 2026-02-12T13:23:38Z

Ran the test for every model, and this change didn't cause any new failures, but did fix the Qwen3 test!

Rocketknight1 marked this pull request as ready for review February 12, 2026 12:57

zucchini-nlp approved these changes Feb 12, 2026

View reviewed changes

Fix flaky test for multimodal LLMs

370b86b

Rocketknight1 force-pushed the fix_generate_continuation_test branch from 7c8bbb3 to 370b86b Compare February 12, 2026 13:20

Rocketknight1 enabled auto-merge (squash) February 12, 2026 13:28

Rocketknight1 merged commit 32def65 into main Feb 12, 2026
26 checks passed

Rocketknight1 deleted the fix_generate_continuation_test branch February 12, 2026 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test for multimodal LLMs#43944

Fix flaky test for multimodal LLMs#43944
Rocketknight1 merged 1 commit intomainfrom
fix_generate_continuation_test

Rocketknight1 commented Feb 12, 2026

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Feb 12, 2026

Uh oh!

Rocketknight1 Feb 12, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Feb 12, 2026

Uh oh!

zucchini-nlp commented Feb 12, 2026

Uh oh!

Rocketknight1 commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Rocketknight1 commented Feb 12, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Feb 12, 2026

Uh oh!

zucchini-nlp commented Feb 12, 2026

Uh oh!

Rocketknight1 commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants