Identify noisy embeddings with standard deviation or range of vector to improve diarization?

I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997). 

I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.

Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.

I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.

SPEAKER_43 here:
https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/2024-10-05_3gvhm0AovZU.html

Matches (cosine similarity > 0.7) garbage in 13/23 other videos (search for "3gvhm0AovZU_SPEAKER_43"):

https://medford-transcripts.github.io/2018-03-20_3N-X2ResFqI/2018-03-20_3N-X2ResFqI.html
https://medford-transcripts.github.io/2020-08-04_n50NtLaAUqY/2020-08-04_n50NtLaAUqY.html
https://medford-transcripts.github.io/2021-05-20_fvIk50DtTTc/2021-05-20_fvIk50DtTTc.html
https://medford-transcripts.github.io/2021-06-07_tZVXN6zzUHw/2021-06-07_tZVXN6zzUHw.html
https://medford-transcripts.github.io/2022-11-16_Azob8X18NRY/2022-11-16_Azob8X18NRY.html
https://medford-transcripts.github.io/2022-11-21_7D6c0Dkkm94/2022-11-21_7D6c0Dkkm94.html
https://medford-transcripts.github.io/2023-01-05_3oP-OTu9DFs/2023-01-05_3oP-OTu9DFs.html
https://medford-transcripts.github.io/2023-02-04_mBOS9fhmkww/2023-02-04_mBOS9fhmkww.html
https://medford-transcripts.github.io/2023-02-06_BSAjmRA8UYk/2023-02-06_BSAjmRA8UYk.html
https://medford-transcripts.github.io/2023-03-02_goLe37yQgNQ/2023-03-02_goLe37yQgNQ,html
https://medford-transcripts.github.io/2023-04-26_Y0_Ezb06bvc/2023-04-26_Y0_Ezb06bvc.html
https://medford-transcripts.github.io/2023-05-23_-GdGrA4wKuQ/2023-05-23_-GdGrA4wKuQ.html
https://medford-transcripts.github.io/2025-01-20_IByfBf6FgY8/2025-01-20_IByfBf6FgY8.html

You can replace the html file in the URL with "model.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/model.pkl) for the transcribed, aligned, and diarized result, and "embeddings.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/embeddings.pkl) for the speaker embeddings. The HTML file has timestamped links to the underlying youtube video.

Does this belong in the pyannote.audio issues?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions