Skip to content

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

@jdeast

Description

@jdeast

I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997).

I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.

Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.

I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.

SPEAKER_43 here:
https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/2024-10-05_3gvhm0AovZU.html

Matches (cosine similarity > 0.7) garbage in 13/23 other videos (search for "3gvhm0AovZU_SPEAKER_43"):

https://medford-transcripts.github.io/2018-03-20_3N-X2ResFqI/2018-03-20_3N-X2ResFqI.html
https://medford-transcripts.github.io/2020-08-04_n50NtLaAUqY/2020-08-04_n50NtLaAUqY.html
https://medford-transcripts.github.io/2021-05-20_fvIk50DtTTc/2021-05-20_fvIk50DtTTc.html
https://medford-transcripts.github.io/2021-06-07_tZVXN6zzUHw/2021-06-07_tZVXN6zzUHw.html
https://medford-transcripts.github.io/2022-11-16_Azob8X18NRY/2022-11-16_Azob8X18NRY.html
https://medford-transcripts.github.io/2022-11-21_7D6c0Dkkm94/2022-11-21_7D6c0Dkkm94.html
https://medford-transcripts.github.io/2023-01-05_3oP-OTu9DFs/2023-01-05_3oP-OTu9DFs.html
https://medford-transcripts.github.io/2023-02-04_mBOS9fhmkww/2023-02-04_mBOS9fhmkww.html
https://medford-transcripts.github.io/2023-02-06_BSAjmRA8UYk/2023-02-06_BSAjmRA8UYk.html
https://medford-transcripts.github.io/2023-03-02_goLe37yQgNQ/2023-03-02_goLe37yQgNQ,html
https://medford-transcripts.github.io/2023-04-26_Y0_Ezb06bvc/2023-04-26_Y0_Ezb06bvc.html
https://medford-transcripts.github.io/2023-05-23_-GdGrA4wKuQ/2023-05-23_-GdGrA4wKuQ.html
https://medford-transcripts.github.io/2025-01-20_IByfBf6FgY8/2025-01-20_IByfBf6FgY8.html

You can replace the html file in the URL with "model.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/model.pkl) for the transcribed, aligned, and diarized result, and "embeddings.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/embeddings.pkl) for the speaker embeddings. The HTML file has timestamped links to the underlying youtube video.

Does this belong in the pyannote.audio issues?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions