-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997).
I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.
Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.
I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.
SPEAKER_43 here:
https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/2024-10-05_3gvhm0AovZU.html
Matches (cosine similarity > 0.7) garbage in 13/23 other videos (search for "3gvhm0AovZU_SPEAKER_43"):
https://medford-transcripts.github.io/2018-03-20_3N-X2ResFqI/2018-03-20_3N-X2ResFqI.html
https://medford-transcripts.github.io/2020-08-04_n50NtLaAUqY/2020-08-04_n50NtLaAUqY.html
https://medford-transcripts.github.io/2021-05-20_fvIk50DtTTc/2021-05-20_fvIk50DtTTc.html
https://medford-transcripts.github.io/2021-06-07_tZVXN6zzUHw/2021-06-07_tZVXN6zzUHw.html
https://medford-transcripts.github.io/2022-11-16_Azob8X18NRY/2022-11-16_Azob8X18NRY.html
https://medford-transcripts.github.io/2022-11-21_7D6c0Dkkm94/2022-11-21_7D6c0Dkkm94.html
https://medford-transcripts.github.io/2023-01-05_3oP-OTu9DFs/2023-01-05_3oP-OTu9DFs.html
https://medford-transcripts.github.io/2023-02-04_mBOS9fhmkww/2023-02-04_mBOS9fhmkww.html
https://medford-transcripts.github.io/2023-02-06_BSAjmRA8UYk/2023-02-06_BSAjmRA8UYk.html
https://medford-transcripts.github.io/2023-03-02_goLe37yQgNQ/2023-03-02_goLe37yQgNQ,html
https://medford-transcripts.github.io/2023-04-26_Y0_Ezb06bvc/2023-04-26_Y0_Ezb06bvc.html
https://medford-transcripts.github.io/2023-05-23_-GdGrA4wKuQ/2023-05-23_-GdGrA4wKuQ.html
https://medford-transcripts.github.io/2025-01-20_IByfBf6FgY8/2025-01-20_IByfBf6FgY8.html
You can replace the html file in the URL with "model.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/model.pkl) for the transcribed, aligned, and diarized result, and "embeddings.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/embeddings.pkl) for the speaker embeddings. The HTML file has timestamped links to the underlying youtube video.
Does this belong in the pyannote.audio issues?