Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix whisper for multi-channel data #1289

Merged
merged 3 commits into from
Mar 7, 2024

Conversation

yuekaizhang
Copy link
Contributor

No description provided.


if len(audio.shape) == 2:
audio = audio[0]
assert (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes if you pass a 2-channel recording, it will silently discard the second channel. I think it’s better to raise an error instead (ie only allow 1d array or 2d array where first dim is 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question: If I have a mutli-channel recoding with shape [num_array, audio_len], e.g. num_array = 8 for aishell4, how does the Fbank feature extractor process it? Am I supposed to average the original audio to get a mono audio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fbank feature extractor seems use torchaudio.compliance.kaldi.fbank, where the input has to be mono. However, I can't find the place where we convert the [8, audio_len] into mono for aishell4.

Copy link
Collaborator

@pzelasko pzelasko Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have 2 options:

  1. downmix to mono, which you'd do it as a separate explicit step, e.g.:
  • def to_mono(self, mono_downmix: bool = False) -> Union["DataCut", List["DataCut"]]:
  • lhotse/lhotse/cut/mixed.py

    Lines 1020 to 1022 in 769c273

    def load_audio(
    self, mixed: bool = True, mono_downmix: bool = False
    ) -> Optional[np.ndarray]:
  • another variant would be to apply beamforming or a similar technique instead of just downmixing
  1. compute features for each channel separately (if that makes sense for what you're going to do later with this); I think the existing feature extractors don't support that, but we can extend them.

But specifically for microphone arrays, the best idea seems to apply beamforming. CHIME6 used beamformit, maybe there are newer/better tools available today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response, @pzelasko I have added the warning for your concern. Would you mind checking it, thanks.

@pzelasko
Copy link
Collaborator

pzelasko commented Mar 7, 2024

Thanks, LGTM!

@pzelasko pzelasko merged commit 7cc8fb4 into lhotse-speech:master Mar 7, 2024
8 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants