Fix flaky test for multimodal LLMs#43944
Conversation
zucchini-nlp
left a comment
There was a problem hiding this comment.
Approving so you can merge after checking that the test passes for other audio models :)
| if ("pixel" in key or key in ["image_patches", "input_feature"]) and key != model.main_input_name: | ||
| if ( | ||
| "pixel" in key | ||
| or key in ["image_patches", "input_feature", "input_features", "feature_attention_mask"] |
There was a problem hiding this comment.
input_features is a quite common name across audio-models like whisper. Can you verify that the test passes for all models?
We can as well override it in qwen3-omni-moe test file, if it fails for other audio models
There was a problem hiding this comment.
Yes! Dropping that input is intentional even when it's present on the model, unless it's the model_main_input. Because we're passing past_key_values to generation, we don't actually want to run multimodal inputs through the encoder again, and doing that is what caused the failure in the past.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Btw, the failing Bark was fixed on |
7c8bbb3 to
370b86b
Compare
|
Ran the test for every model, and this change didn't cause any new failures, but did fix the Qwen3 test! |
We have flaky test failures in
tests/models/qwen3_omni_moe/test_modeling_qwen3_omni_moe.py::Qwen3OmniMoeThinkerForConditionalGenerationModelTest::test_generate_continue_from_past_key_values. The cause is that the logic in this test drops multimodal inputs (because it's passingpast_key_valuesto generate instead), but it fails to drop audio inputs correctly. This causes flaky failures in LLMs like Qwen3OmniMoE, which I think occur when the audio encoder creates audio features that overwrite the embeddings for generated tokens (maybe when the tokens have theaudio_token_idvalue?)I'm not 100% sure of the cause, but the test fails 3% of the time without this fix and 0% with it, so I'm pretty sure dropping inputs correctly is the solution!
cc @zucchini-nlp