Skip to content

Fix flaky test for multimodal LLMs#43944

Merged
Rocketknight1 merged 1 commit intomainfrom
fix_generate_continuation_test
Feb 12, 2026
Merged

Fix flaky test for multimodal LLMs#43944
Rocketknight1 merged 1 commit intomainfrom
fix_generate_continuation_test

Conversation

@Rocketknight1
Copy link
Member

We have flaky test failures in tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_generate_continue_from_past_key_values. The cause is that the logic in this test drops multimodal inputs (because it's passing past_key_values to generate instead), but it fails to drop audio inputs correctly. This causes flaky failures in LLMs like Qwen3OmniMoE, which I think occur when the audio encoder creates audio features that overwrite the embeddings for generated tokens (maybe when the tokens have the audio_token_id value?)

I'm not 100% sure of the cause, but the test fails 3% of the time without this fix and 0% with it, so I'm pretty sure dropping inputs correctly is the solution!

cc @zucchini-nlp

@Rocketknight1 Rocketknight1 marked this pull request as ready for review February 12, 2026 12:57
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving so you can merge after checking that the test passes for other audio models :)

if ("pixel" in key or key in ["image_patches", "input_feature"]) and key != model.main_input_name:
if (
"pixel" in key
or key in ["image_patches", "input_feature", "input_features", "feature_attention_mask"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_features is a quite common name across audio-models like whisper. Can you verify that the test passes for all models?
We can as well override it in qwen3-omni-moe test file, if it fails for other audio models

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Dropping that input is intentional even when it's present on the model, unless it's the model_main_input. Because we're passing past_key_values to generation, we don't actually want to run multimodal inputs through the encoder again, and doing that is what caused the failure in the past.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member

Btw, the failing Bark was fixed on main :)

@Rocketknight1 Rocketknight1 force-pushed the fix_generate_continuation_test branch from 7c8bbb3 to 370b86b Compare February 12, 2026 13:20
@Rocketknight1
Copy link
Member Author

Ran the test for every model, and this change didn't cause any new failures, but did fix the Qwen3 test!

@Rocketknight1 Rocketknight1 enabled auto-merge (squash) February 12, 2026 13:28
@Rocketknight1 Rocketknight1 merged commit 32def65 into main Feb 12, 2026
26 checks passed
@Rocketknight1 Rocketknight1 deleted the fix_generate_continuation_test branch February 12, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants