Fixing bug in Voxtral when merging text and audio embeddings #40671

rcogill · 2025-09-03T21:54:46Z

What does this PR do?

This PR addresses two issues arising from an in-place operation where audio and text embeddings are merged in the Voxtral model. This fixes Issue #40488 and an unreported issue resulting from using device_map="auto" with the Voxtral model.

More detail about the two issues and their resolution:

In the issue reported in #40488, the forward method of VoxtralForConditionalGeneration fails when using LoRA. The underlying issue is that when the text embedding layer is frozen, the inputs_embeds tensor extracted from the embedding layer is a leaf tensor. When inputs_embeds is a leaf tensor that requires gradients, values of this tensor cannot be reassigned. To address this, inputs_embeds will be now be cloned to enable tracking of gradients when it is a leaf tensor that requires gradients.
The second unreported issue arises when using device_map="auto" with Voxtral. When using device_map="auto", audio and text layers might be distributed across different devices. When this is the case, inputs_embeds and audio_embeds might be on different devices. In this case, attempting to assign values of audio_embeds to inputs_embeds will fail since both tensors are expected to be on the same device. To address this, audio_embeds is moved to the same device as inputs_embeds before updating values of inputs_embeds.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. -> Voxtral model fails with LoRA due to in-place operation error #40488
Did you make sure to update the documentation with your changes? -> No documentation changes necessary.
Did you write any new necessary tests? -> No new features added.

Who can review?

@eustlb

…beddings

vasqu · 2025-09-04T11:40:05Z

src/transformers/models/voxtral/modular_voxtral.py

+            # Enable gradient tracking when inputs_embeds is a leaf tensor
+            if inputs_embeds.is_leaf and inputs_embeds.requires_grad:
+                inputs_embeds = inputs_embeds.clone()
            # replace text-audio token placeholders with audio embeddings
            audio_token_mask = input_ids == self.config.audio_token_id
-            inputs_embeds[audio_token_mask] = audio_embeds
+            inputs_embeds[audio_token_mask] = audio_embeds.to(inputs_embeds.device)


Not a fan of using a clone tbh, can we use a masked scatter here instead, e.g. see

transformers/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

Line 1206 in 91b34be

inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)

Should be out-of-place op then, so I would hope it resolves the inplace operation issue

(probably still needs the device movement)

eustlb

Thanks @rcogill for working on this and @vasqu for first review! Quick benchmarks show non-inplace masked_scatter is faster anyway. Can you add the suggested change and good to merge then.

eustlb · 2025-09-04T14:26:00Z

src/transformers/models/voxtral/modular_voxtral.py

+            # Enable gradient tracking when inputs_embeds is a leaf tensor
+            if inputs_embeds.is_leaf and inputs_embeds.requires_grad:
+                inputs_embeds = inputs_embeds.clone()
            # replace text-audio token placeholders with audio embeddings
            audio_token_mask = input_ids == self.config.audio_token_id
-            inputs_embeds[audio_token_mask] = audio_embeds
+            inputs_embeds[audio_token_mask] = audio_embeds.to(inputs_embeds.device)


Agree here with @vasqu, let's not forget also to move the mask to the correct device and make it's broadcastable!
Can you please change to:

# replace text-audio token placeholders with audio embeddings audio_token_mask = (input_ids == self.config.audio_token_id).unsqueeze(-1) inputs_embeds = inputs_embeds.masked_scatter(audio_token_mask.to(inputs_embeds.device), audio_embeds.to(inputs_embeds.device))

pushed directly to shortcut and merge! Thanks again, @vasqu and @rcogill

github-actions · 2025-09-04T14:37:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: voxtral

vasqu · 2025-09-04T14:38:28Z

Adding a for patch labels here @eustlb? And glad to help ❤️

eustlb · 2025-09-04T14:44:21Z

Actually no, for-patch label is for what has been broken in last release, which is not the case here ;)

HuggingFaceDocBuilderDev · 2025-09-04T15:12:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rcogill · 2025-09-04T15:15:50Z

@eustlb and @vasqu , thank you both!

Fixing bug when replacing text-audio token placeholders with audio em…

f9bbbbc

…beddings

rcogill mentioned this pull request Sep 3, 2025

Voxtral model fails with LoRA due to in-place operation error #40488

Closed

4 tasks

vasqu reviewed Sep 4, 2025

View reviewed changes

eustlb approved these changes Sep 4, 2025

View reviewed changes

eustlb and others added 2 commits September 4, 2025 16:36

apply changes

47fa46e

Merge branch 'main' into dev-voxtral-embedding-bugfix

bd0751c

eustlb enabled auto-merge (squash) September 4, 2025 14:37

eustlb added for patch Tag issues / labels that should be included in the next patch and removed for patch Tag issues / labels that should be included in the next patch labels Sep 4, 2025

eustlb and others added 2 commits September 4, 2025 17:02

audio token replacement does not make sense when input ids not provided

eb17a17

Merge branch 'main' into dev-voxtral-embedding-bugfix

988cb9c

eustlb merged commit 4cbca0d into huggingface:main Sep 4, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixing bug in Voxtral when merging text and audio embeddings #40671

Fixing bug in Voxtral when merging text and audio embeddings #40671

Uh oh!

rcogill commented Sep 3, 2025

Uh oh!

vasqu Sep 4, 2025

Uh oh!

vasqu Sep 4, 2025

Uh oh!

eustlb left a comment

Uh oh!

eustlb Sep 4, 2025

Uh oh!

eustlb Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

vasqu commented Sep 4, 2025

Uh oh!

eustlb commented Sep 4, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 4, 2025

Uh oh!

rcogill commented Sep 4, 2025

Uh oh!

Uh oh!

Fixing bug in Voxtral when merging text and audio embeddings #40671

Fixing bug in Voxtral when merging text and audio embeddings #40671

Uh oh!

Conversation

rcogill commented Sep 3, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

eustlb left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

eustlb Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

vasqu commented Sep 4, 2025

Uh oh!

eustlb commented Sep 4, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 4, 2025

Uh oh!

rcogill commented Sep 4, 2025

Uh oh!

Uh oh!