-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Fixing bug in Voxtral when merging text and audio embeddings #40671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Enable gradient tracking when inputs_embeds is a leaf tensor | ||
if inputs_embeds.is_leaf and inputs_embeds.requires_grad: | ||
inputs_embeds = inputs_embeds.clone() | ||
# replace text-audio token placeholders with audio embeddings | ||
audio_token_mask = input_ids == self.config.audio_token_id | ||
inputs_embeds[audio_token_mask] = audio_embeds | ||
inputs_embeds[audio_token_mask] = audio_embeds.to(inputs_embeds.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of using a clone tbh, can we use a masked scatter here instead, e.g. see
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be out-of-place op then, so I would hope it resolves the inplace operation issue
(probably still needs the device movement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Enable gradient tracking when inputs_embeds is a leaf tensor | ||
if inputs_embeds.is_leaf and inputs_embeds.requires_grad: | ||
inputs_embeds = inputs_embeds.clone() | ||
# replace text-audio token placeholders with audio embeddings | ||
audio_token_mask = input_ids == self.config.audio_token_id | ||
inputs_embeds[audio_token_mask] = audio_embeds | ||
inputs_embeds[audio_token_mask] = audio_embeds.to(inputs_embeds.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree here with @vasqu, let's not forget also to move the mask to the correct device and make it's broadcastable!
Can you please change to:
# replace text-audio token placeholders with audio embeddings
audio_token_mask = (input_ids == self.config.audio_token_id).unsqueeze(-1)
inputs_embeds = inputs_embeds.masked_scatter(audio_token_mask.to(inputs_embeds.device), audio_embeds.to(inputs_embeds.device))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[For maintainers] Suggested jobs to run (before merge) run-slow: voxtral |
Adding a for patch labels here @eustlb? And glad to help ❤️ |
Actually no, for-patch label is for what has been broken in last release, which is not the case here ;) |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
This PR addresses two issues arising from an in-place operation where audio and text embeddings are merged in the Voxtral model. This fixes Issue #40488 and an unreported issue resulting from using
device_map="auto"
with the Voxtral model.More detail about the two issues and their resolution:
forward
method ofVoxtralForConditionalGeneration
fails when using LoRA. The underlying issue is that when the text embedding layer is frozen, theinputs_embeds
tensor extracted from the embedding layer is a leaf tensor. Wheninputs_embeds
is a leaf tensor that requires gradients, values of this tensor cannot be reassigned. To address this,inputs_embeds
will be now be cloned to enable tracking of gradients when it is a leaf tensor that requires gradients.device_map="auto"
with Voxtral. When usingdevice_map="auto"
, audio and text layers might be distributed across different devices. When this is the case,inputs_embeds
andaudio_embeds
might be on different devices. In this case, attempting to assign values ofaudio_embeds
toinputs_embeds
will fail since both tensors are expected to be on the same device. To address this,audio_embeds
is moved to the same device asinputs_embeds
before updating values ofinputs_embeds
.Before submitting
to it if that's the case. -> Voxtral model fails with LoRA due to in-place operation error #40488
Who can review?
@eustlb