Skip to content

Conversation

@D4ve-R
Copy link
Contributor

@D4ve-R D4ve-R commented Feb 28, 2024

This adds support for WavLM- & Wav2Vec2ForAudioFrameClassification models.
The models can be used for speaker diarization tasks.

"Official model" microsoft/wavlm-base-plus-sd

  • Add AutoModelForAudioFrameClassification
  • Add Wav2Vec2ForAudioFrameClassification
  • Add WavLMForAudioFrameClassification

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Feb 28, 2024

@xenova
Copy link
Collaborator

xenova commented Mar 4, 2024

Thanks for the PR! Could you maybe explain how the output of the audio frame classification model should be interpreted? The example code in the model card produces a one-hot array of shape (num_frames, num_speakers), but it would be nice to be able to turn that into a JSON output with timestamps.

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Mar 4, 2024

Yes of course! The output logits are of dimension [num_batches, num_frames, num_speakers].
For now I can't tell you more, but I'm doing experiments, because I had the same idea 😆.
Since there is no speaker-diarization task in transformers, how do you feel about implementing a new pipeline for that task in transformers.js? I think it would be really cool.

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Mar 4, 2024

I suspect the model output to be not ready "out-of-the-box" for such a pipeline, because of possible overlapping etc.(I need to check the paper & code, haven't had time yet)
I'm working on porting this code for a speaker diarization pipeline to js and with different models if needed.
It would be really awesome to be able to combine it with whisper for accurate conversational transcription in the browser 🤯

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit 8eef154 into huggingface:main Mar 7, 2024
@flatsiedatsie
Copy link
Contributor

@D4ve-R Did you manage to get the model to classify into three different speakers? The current inplementation in Transformers.js seems only split into two speakers.

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Sep 6, 2024

@flatsiedatsie sorry can't remember.

Here is a working example, to run speaker diarization with whisper + pyannote in transformer.js

https://huggingface.co/spaces/Xenova/whisper-speaker-diarization/blob/main/whisper-speaker-diarization/src/worker.js

Hope it helps

@flatsiedatsie
Copy link
Contributor

Thanks, that's very kind.

I'm currently testing an implementation where I recursively re-segment long segments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants