Skip to content

fix(asr): resolve tensor device mismatch in multi-GPU environments#260

Open
JasonOA888 wants to merge 2 commits intomicrosoft:mainfrom
JasonOA888:fix/issue-240-multi-gpu-device-mismatch
Open

fix(asr): resolve tensor device mismatch in multi-GPU environments#260
JasonOA888 wants to merge 2 commits intomicrosoft:mainfrom
JasonOA888:fix/issue-240-multi-gpu-device-mismatch

Conversation

@JasonOA888
Copy link
Copy Markdown

Summary

Fixes #240

Problem

When using device_map=auto with multiple GPUs, inference fails with:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:6)

This occurs at modeling_vibevoice_asr.py:335:

combined_features = acoustic_features[speech_masks] + semantic_features[speech_masks]

Root Cause

accelerate's device_map=auto distributes model sublayers across GPUs. The acoustic_tokenizer may end up on cuda:0 while semantic_connector layers are on cuda:6. But speech_masks (created on CPU or cuda:0) is used to index tensors on different devices.

Similarly, acoustic_input_mask indexes inputs_embeds without device alignment.

Solution

Move all mask/index tensors to the target device before indexing:

  1. speech_masks.to(acoustic_features.device) in encode_speech()
  2. acoustic_input_mask.to(inputs_embeds.device) in forward()
  3. speech_masks.to(audio_features.device) in modeling_vibevoice.py forward_speech_features()
  4. speech_semantic_tensors device alignment before passing to semantic_connector

Testing

  • Verified that all tensor indexing operations now use device-aligned masks
  • Compatible with single-GPU, multi-GPU, and CPU inference
  • No behavior change when devices already match (.to() is a no-op)

Files Changed

  • vibevoice/modular/modeling_vibevoice_asr.py - Fixed 2 device mismatch locations
  • vibevoice/modular/modeling_vibevoice.py - Fixed 2 device mismatch locations

When using device_map=auto, multi-GPU inference, tensors from different model
components may reside on different devices. The causes:

1. speech_masks indexing acoustic/semantic features fails when masks
   are on a different device than the features
2. acoustic_input_mask indexing inputs_embeds fails when mask
   is on a different device than the embeddings

Root cause: accelerate's device_map=auto distributes model sublayers across
GPUs. The acoustic_tokenizer and semantic_tokenizer may end up on different
devices, the language_model layers. The speech_masks and acoustic_input_mask
(usually on CPU or cuda:0) are not moved to match.

Fix: Ensure all masks are moved to the same device as the tensors they
index before any indexing operation.

Fixes microsoft#240
When using device_map=auto, multi-GPU inference fails because
index/mask tensors reside on different devices than
the data they are indexing into.

Three locations fixed:
1. encode_speech(): speech_masks moved to features device before indexing
2. forward(): acoustic_input_mask move to inputs_embeds device
3. encode_speech(): speech_semantic_tensors move to connector device

For speech_semantic_tensors

4. All mask/index tensors are now moved to the target device
before use, preventing 'indices should be either on cpu or on the same device as the indexed tensor' errors.

Fixes microsoft#240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tensor device mismatch error when using multi-GPU with device_map=auto

2 participants