fix whisper for multi-channel data #1289

yuekaizhang · 2024-02-22T14:35:17Z

No description provided.

pzelasko · 2024-02-23T02:39:54Z

lhotse/features/whisper_fbank.py

+
+    if len(audio.shape) == 2:
+        audio = audio[0]
+    assert (


With these changes if you pass a 2-channel recording, it will silently discard the second channel. I think it’s better to raise an error instead (ie only allow 1d array or 2d array where first dim is 1)

Dumb question: If I have a mutli-channel recoding with shape [num_array, audio_len], e.g. num_array = 8 for aishell4, how does the Fbank feature extractor process it? Am I supposed to average the original audio to get a mono audio?

Fbank feature extractor seems use torchaudio.compliance.kaldi.fbank, where the input has to be mono. However, I can't find the place where we convert the [8, audio_len] into mono for aishell4.

You have 2 options:

downmix to mono, which you'd do it as a separate explicit step, e.g.:

lhotse/lhotse/cut/multi.py

Line 401 in 769c273

def to_mono(self, mono_downmix: bool = False) -> Union["DataCut", List["DataCut"]]:

lhotse/lhotse/cut/mixed.py

Lines 1020 to 1022 in 769c273

def load_audio(

self, mixed: bool = True, mono_downmix: bool = False

) -> Optional[np.ndarray]:

another variant would be to apply beamforming or a similar technique instead of just downmixing

compute features for each channel separately (if that makes sense for what you're going to do later with this); I think the existing feature extractors don't support that, but we can extend them.

But specifically for microphone arrays, the best idea seems to apply beamforming. CHIME6 used beamformit, maybe there are newer/better tools available today.

Sorry for the late response, @pzelasko I have added the warning for your concern. Would you mind checking it, thanks.

pzelasko · 2024-03-07T16:20:23Z

Thanks, LGTM!

fix whisper for multi-channel data

e3324b8

pzelasko reviewed Feb 23, 2024

View reviewed changes

yuekaizhang added 2 commits March 6, 2024 07:15

update warning for multi-channel data

a91264e

Merge branch 'master' into whisper_fix

76b6049

pzelasko merged commit 7cc8fb4 into lhotse-speech:master Mar 7, 2024
8 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix whisper for multi-channel data #1289

fix whisper for multi-channel data #1289

yuekaizhang commented Feb 22, 2024

pzelasko Feb 23, 2024

yuekaizhang Feb 23, 2024

yuekaizhang Feb 23, 2024

pzelasko Feb 23, 2024 •

edited

Loading

yuekaizhang Mar 6, 2024

pzelasko commented Mar 7, 2024

	def load_audio(
	self, mixed: bool = True, mono_downmix: bool = False
	) -> Optional[np.ndarray]:

fix whisper for multi-channel data #1289

fix whisper for multi-channel data #1289

Conversation

yuekaizhang commented Feb 22, 2024

pzelasko Feb 23, 2024

Choose a reason for hiding this comment

yuekaizhang Feb 23, 2024

Choose a reason for hiding this comment

yuekaizhang Feb 23, 2024

Choose a reason for hiding this comment

pzelasko Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

yuekaizhang Mar 6, 2024

Choose a reason for hiding this comment

pzelasko commented Mar 7, 2024

pzelasko Feb 23, 2024 •

edited

Loading