fix(asr): resolve tensor device mismatch in multi-GPU environments by JasonOA888 · Pull Request #260 · microsoft/VibeVoice

JasonOA888 · 2026-03-27T13:05:27Z

Summary

Fixes #240

Problem

When using device_map=auto with multiple GPUs, inference fails with:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:6)

This occurs at modeling_vibevoice_asr.py:335:

combined_features = acoustic_features[speech_masks] + semantic_features[speech_masks]

Root Cause

accelerate's device_map=auto distributes model sublayers across GPUs. The acoustic_tokenizer may end up on cuda:0 while semantic_connector layers are on cuda:6. But speech_masks (created on CPU or cuda:0) is used to index tensors on different devices.

Similarly, acoustic_input_mask indexes inputs_embeds without device alignment.

Solution

Move all mask/index tensors to the target device before indexing:

speech_masks.to(acoustic_features.device) in encode_speech()
acoustic_input_mask.to(inputs_embeds.device) in forward()
speech_masks.to(audio_features.device) in modeling_vibevoice.py forward_speech_features()
speech_semantic_tensors device alignment before passing to semantic_connector

Testing

Verified that all tensor indexing operations now use device-aligned masks
Compatible with single-GPU, multi-GPU, and CPU inference
No behavior change when devices already match (.to() is a no-op)

Files Changed

vibevoice/modular/modeling_vibevoice_asr.py - Fixed 2 device mismatch locations
vibevoice/modular/modeling_vibevoice.py - Fixed 2 device mismatch locations

When using device_map=auto, multi-GPU inference, tensors from different model components may reside on different devices. The causes: 1. speech_masks indexing acoustic/semantic features fails when masks are on a different device than the features 2. acoustic_input_mask indexing inputs_embeds fails when mask is on a different device than the embeddings Root cause: accelerate's device_map=auto distributes model sublayers across GPUs. The acoustic_tokenizer and semantic_tokenizer may end up on different devices, the language_model layers. The speech_masks and acoustic_input_mask (usually on CPU or cuda:0) are not moved to match. Fix: Ensure all masks are moved to the same device as the tensors they index before any indexing operation. Fixes microsoft#240

When using device_map=auto, multi-GPU inference fails because index/mask tensors reside on different devices than the data they are indexing into. Three locations fixed: 1. encode_speech(): speech_masks moved to features device before indexing 2. forward(): acoustic_input_mask move to inputs_embeds device 3. encode_speech(): speech_semantic_tensors move to connector device For speech_semantic_tensors 4. All mask/index tensors are now moved to the target device before use, preventing 'indices should be either on cpu or on the same device as the indexed tensor' errors. Fixes microsoft#240

web-flow added 2 commits March 27, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(asr): resolve tensor device mismatch in multi-GPU environments#260

fix(asr): resolve tensor device mismatch in multi-GPU environments#260
JasonOA888 wants to merge 2 commits intomicrosoft:mainfrom
JasonOA888:fix/issue-240-multi-gpu-device-mismatch

JasonOA888 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonOA888 commented Mar 27, 2026

Summary

Problem

Root Cause

Solution

Testing

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants