-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix whisper for multi-channel data #1289
Conversation
|
||
if len(audio.shape) == 2: | ||
audio = audio[0] | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these changes if you pass a 2-channel recording, it will silently discard the second channel. I think it’s better to raise an error instead (ie only allow 1d array or 2d array where first dim is 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: If I have a mutli-channel recoding with shape [num_array, audio_len], e.g. num_array = 8 for aishell4, how does the Fbank feature extractor process it? Am I supposed to average the original audio to get a mono audio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fbank feature extractor seems use torchaudio.compliance.kaldi.fbank, where the input has to be mono. However, I can't find the place where we convert the [8, audio_len] into mono for aishell4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have 2 options:
- downmix to mono, which you'd do it as a separate explicit step, e.g.:
Line 401 in 769c273
def to_mono(self, mono_downmix: bool = False) -> Union["DataCut", List["DataCut"]]: Lines 1020 to 1022 in 769c273
def load_audio( self, mixed: bool = True, mono_downmix: bool = False ) -> Optional[np.ndarray]: - another variant would be to apply beamforming or a similar technique instead of just downmixing
- compute features for each channel separately (if that makes sense for what you're going to do later with this); I think the existing feature extractors don't support that, but we can extend them.
But specifically for microphone arrays, the best idea seems to apply beamforming. CHIME6 used beamformit, maybe there are newer/better tools available today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response, @pzelasko I have added the warning for your concern. Would you mind checking it, thanks.
Thanks, LGTM! |
No description provided.